PISA 2012 Data Exploration

by Anna Pedroni

From the PISA Data Visualization Contest webpage:

PISA is a worldwide study developed by the Organisation for Economic Co-operation and Development (OECD) which examines the skills of 15-year-old school students around the world. The study assesses students’ mathematics, science, and reading skills and contains a wealth of information on students’ background, their school and the organisation of education systems. For most countries, the sample is around 5,000 students, but in some countries the number is even higher. In total, the PISA 2012 dataset contains data on 485 490 pupils.

A detailed description of the methodology of the PISA surveys can be found in the PISA 2012 Technical Report.

A few points of interest.

PISA:

  • is an age-based survey, assessing 15-year-old students in school in grade 7 or higher. These students are approaching the end of compulsory schooling in most participating countries, and school enrolment at this level is close to universal in almost all OECD countries;
  • take a literacy perspective, which focuses on the extent to which students can apply the knowledge and skills they have learned and practised at school when confronted with situations and challenges for which that knowledge may be relevant;
  • allows for the assessment of additional cross-curricular competencies [...]. For 2012 a computer-delivered assessment of mathematics and problem solving was added, along with an assessment of financial literacy;
  • uses Student Questionnaires to collect information from students on various aspects of their home, family and school background;
  • uses School Questionnaires to collect information from schools about various aspects of organisation and educational provision in schools;
  • uses Parent Questionnaires administered to the parents of the students participating in PISA (in 11 countries for the 2012 survey);

  • test were administered in the language of instruction of mathematics

Technical report, p.67:

Students whose language of instruction for mathematics (the major domain for 2012), was one for which no PISA assessment materials were available. Standard 2.1 of the PISA 2012 Technical Standards (see Annex F) notes that the PISA test is administered to a student in a language of instruction provided by the sampled school to that sampled student in the major domain of the test. Thus, if no test materials were available in the language in which the sampled student is taught, the student was excluded.

Focus and Partecipation

PISA 2012, the fifth PISA survey covered reading, mathematics, science, problem solving and financial literacy with a primary focus on mathematics.

It was conducted in 34 OECD countries and 31 partner countries/economies. All 65 countries/economies completed the paper-based tests, with assessments lasting a total of two hours for each student.

An additional 40 minutes were devoted to the computer-based assessment of

  • problem solving, in 44 countries/economies;
  • mathematics and reading, in 32 countries/economies;
  • financial literacy, in 18 countries/economies.

The full list of participants can be found here.

Whether they took part in the additional computer-based assessments or not can be found in the Technical Report at pp.23-24.

Accordingly to the PISA Technical report, using the data from Student, Parent and School Questionnaires, analyses linking contextual information with student achievement could address:

  • differences between countries in the relationships between student-level factors (such as gender and socio-economic background) and achievement;
  • differences in the relationships between school-level factors and achievement across countries;
  • differences in the proportion of variation in achievement between (rather than within) schools, and differences in this value across countries;
  • differences between countries in the extent to which schools moderate or increase the effects of individual-level student factors and student achievement;
  • differences in education systems and national context that are related to differences in student achievement across countries; and
  • through links to PISA 2000, PISA 2003, PISA 2006 and PISA 2009, changes in any or all of these relationships over time.

Preliminary Wrangling

The links to 2 files where provided with the Udacity description of the databases for the project:

  • PISA Data: pisa2012.csv(.zip) -file with the data about the 485 490 pupils
  • PISA Data Dictionary: pisa.dict2012.csv - name of the columns and their description

Download the data:

In [1]:
# import packages to download files and manage folders
import os
import requests
import zipfile
In [2]:
# create a folder and get the files

folder_name = 'PISA_data'

if not os.path.exists(folder_name):
    os.makedirs(folder_name)

urls = ['https://s3.amazonaws.com/udacity-hosted-downloads/ud507/pisa2012.csv.zip',
        'https://s3.amazonaws.com/udacity-hosted-downloads/ud507/pisadict2012.csv']
         
for url in urls:
    response = requests.get(url)
    file_name = url.split('/')[-1]
    with open(os.path.join(folder_name, file_name), mode='wb') as file:
        file.write(response.content)
In [3]:
# unzip the PISA data 

file_name = 'pisa2012.csv.zip'

with zipfile.ZipFile(os.path.join(folder_name, file_name)) as data_zip:
    data_zip.extractall(folder_name)
print('Unzipped')

# remove the .zip file
os.remove(os.path.join(folder_name, file_name))
print('Removed')
Unzipped
Removed

Import the data and have a look

In [4]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline

The dictionary of variables

In [5]:
# load the data and have a look
pisa_variables = pd.read_csv('PISA_data\pisadict2012.csv', encoding='latin-1', dtype='unicode')
pisa_variables.head()
Out[5]:
Unnamed: 0 x
0 CNT Country code 3-character
1 SUBNATIO Adjudicated sub-region code 7-digit code (3-di...
2 STRATUM Stratum ID 7-character (cnt + region ID + orig...
3 OECD OECD country
4 NC National Centre 6-digit Code
In [6]:
pisa_variables.shape
Out[6]:
(635, 2)
In [7]:
pisa_variables.columns = ['code', 'x']
In [8]:
pisa_variables['x'].values
Out[8]:
array(['Country code 3-character',
       'Adjudicated sub-region code 7-digit code (3-digit country code + region ID + stratum ID)',
       'Stratum ID 7-character (cnt + region ID + original stratum ID)',
       'OECD country', 'National Centre 6-digit Code',
       'School ID 7-digit (region ID + stratum ID + 3-digit school ID)',
       'Student ID', 'International Grade', 'National Study Programme',
       'Birth - Month', 'Birth -Year', 'Gender', 'Attend <ISCED 0>',
       'Age at <ISCED 1>', 'Repeat - <ISCED 1>', 'Repeat - <ISCED 2>',
       'Repeat - <ISCED 3>', 'Truancy - Late for School',
       'Truancy - Skip whole school day',
       'Truancy - Skip classes within school day', 'At Home - Mother',
       'At Home - Father', 'At Home - Brothers', 'At Home - Sisters',
       'At Home - Grandparents', 'At Home - Others',
       'Mother<Highest Schooling>',
       'Mother Qualifications - <ISCED level 6>',
       'Mother Qualifications - <ISCED level 5A>',
       'Mother Qualifications - <ISCED level 5B>',
       'Mother Qualifications - <ISCED level 4>',
       'Mother Current Job Status', 'Father<Highest Schooling>',
       'Father Qualifications - <ISCED level 6>',
       'Father Qualifications - <ISCED level 5A>',
       'Father Qualifications - <ISCED level 5B>',
       'Father Qualifications - <ISCED level 4>',
       'Father Current Job Status',
       'Country of Birth International - Self',
       'Country of Birth International - Mother',
       'Country of Birth International - Father',
       'Age of arrival in <country of test>',
       'International Language at Home', 'Possessions - desk',
       'Possessions - own room', 'Possessions - study place',
       'Possessions - computer', 'Possessions - software',
       'Possessions - Internet', 'Possessions - literature',
       'Possessions - poetry', 'Possessions - art',
       'Possessions - textbooks',
       'Possessions - <technical reference books>',
       'Possessions - dictionary', 'Possessions - dishwasher',
       'Possessions - <DVD>', 'Possessions - <Country item 1>',
       'Possessions - <Country item 2>', 'Possessions - <Country item 3>',
       'How many - cellular phones', 'How many - televisions',
       'How many - computers', 'How many - cars',
       'How many - rooms bath or shower', 'How many books at home',
       'Math Interest - Enjoy Reading',
       'Instrumental Motivation - Worthwhile for Work',
       'Math Interest - Look Forward to Lessons',
       'Math Interest - Enjoy Maths',
       'Instrumental Motivation - Worthwhile for Career Chances',
       'Math Interest - Interested',
       'Instrumental Motivation - Important for Future Study',
       'Instrumental Motivation - Helps to Get a Job',
       'Subjective Norms -Friends Do Well in Mathematics',
       'Subjective Norms -Friends Work Hard on Mathematics',
       'Subjective Norms - Friends Enjoy Mathematics Tests',
       'Subjective Norms - Parents Believe Studying Mathematics Is Important',
       'Subjective Norms - Parents Believe Mathematics Is Important for Career',
       'Subjective Norms - Parents Like Mathematics',
       'Math Self-Efficacy - Using a <Train Timetable>',
       'Math Self-Efficacy - Calculating TV Discount',
       'Math Self-Efficacy - Calculating Square Metres of Tiles',
       'Math Self-Efficacy - Understanding Graphs in Newspapers',
       'Math Self-Efficacy - Solving Equation 1',
       'Math Self-Efficacy - Distance to Scale',
       'Math Self-Efficacy - Solving Equation 2',
       'Math Self-Efficacy - Calculate Petrol Consumption Rate',
       'Math Anxiety - Worry That It Will Be Difficult',
       'Math Self-Concept - Not Good at Maths',
       'Math Anxiety - Get Very Tense',
       'Math Self-Concept- Get Good <Grades>',
       'Math Anxiety - Get Very Nervous',
       'Math Self-Concept - Learn Quickly',
       'Math Self-Concept - One of Best Subjects',
       'Math Anxiety - Feel Helpless',
       'Math Self-Concept - Understand Difficult Work',
       'Math Anxiety - Worry About Getting Poor <Grades>',
       'Perceived Control - Can Succeed with Enough Effort',
       'Perceived Control - Doing Well is Completely Up to Me',
       'Perceived Control - Family Demands and Problems',
       'Perceived Control - Different Teachers',
       'Perceived Control - If I Wanted I Could Perform Well',
       'Perceived Control - Perform Poorly Regardless',
       'Attributions to Failure - Not Good at Maths Problems',
       'Attributions to Failure - Teacher Did Not Explain Well',
       'Attributions to Failure - Bad Guesses',
       'Attributions to Failure - Material Too Hard',
       'Attributions to Failure - Teacher Didnt Get Students Interested',
       'Attributions to Failure - Unlucky',
       'Math Work Ethic - Homework Completed in Time',
       'Math Work Ethic - Work Hard on Homework',
       'Math Work Ethic - Prepared for Exams',
       'Math Work Ethic - Study Hard for Quizzes',
       'Math Work Ethic - Study Until I Understand Everything',
       'Math Work Ethic - Pay Attention in Classes',
       'Math Work Ethic - Listen in Classes',
       'Math Work Ethic - Avoid Distractions When Studying',
       'Math Work Ethic - Keep Work Organized',
       'Math Intentions - Mathematics vs. Language Courses After School',
       'Math Intentions - Mathematics vs. Science Related Major in College',
       'Math Intentions - Study Harder in Mathematics vs. Language Classes',
       'Math Intentions - Take Maximum Number of Mathematics vs. Science Classes',
       'Math Intentions - Pursuing a Career That Involves Mathematics vs. Science',
       'Math Behaviour - Talk about Maths with Friends',
       'Math Behaviour - Help Friends with Maths',
       'Math Behaviour - <Extracurricular> Activity',
       'Math Behaviour - Participate in Competitions',
       'Math Behaviour - Study More Than 2 Extra Hours a Day',
       'Math Behaviour - Play Chess',
       'Math Behaviour - Computer programming',
       'Math Behaviour - Participate in Math Club',
       'Learning Strategies- Important Parts vs. Existing Knowledge vs. Learn by Heart',
       'Learning Strategies- Improve Understanding vs. New Ways vs. Memory',
       'Learning Strategies - Other Subjects vs. Learning Goals vs. Rehearse Problems',
       'Learning Strategies - Repeat Examples vs. Everyday Applications vs. More Information',
       'Out of school lessons - <test lang>',
       'Out of school lessons - <maths>',
       'Out of school lessons - <science>',
       'Out of school lessons - other',
       'Out-of-School Study Time - Homework',
       'Out-of-School Study Time - Guided Homework',
       'Out-of-School Study Time - Personal Tutor',
       'Out-of-School Study Time - Commercial Company',
       'Out-of-School Study Time - With Parent',
       'Out-of-School Study Time - Computer',
       'Experience with Applied Maths Tasks - Use <Train Timetable>',
       'Experience with Applied Maths Tasks - Calculate Price including Tax',
       'Experience with Applied Maths Tasks - Calculate Square Metres',
       'Experience with Applied Maths Tasks - Understand Scientific Tables',
       'Experience with Pure Maths Tasks - Solve Equation 1',
       'Experience with Applied Maths Tasks - Use a Map to Calculate Distance',
       'Experience with Pure Maths Tasks - Solve Equation 2',
       'Experience with Applied Maths Tasks - Calculate Power Consumption Rate',
       'Experience with Applied Maths Tasks - Solve Equation 3',
       'Familiarity with Math Concepts - Exponential Function',
       'Familiarity with Math Concepts - Divisor',
       'Familiarity with Math Concepts - Quadratic Function',
       'Overclaiming - Proper Number',
       'Familiarity with Math Concepts - Linear Equation',
       'Familiarity with Math Concepts - Vectors',
       'Familiarity with Math Concepts - Complex Number',
       'Familiarity with Math Concepts - Rational Number',
       'Familiarity with Math Concepts - Radicals',
       'Overclaiming - Subjunctive Scaling',
       'Familiarity with Math Concepts - Polygon',
       'Overclaiming - Declarative Fraction',
       'Familiarity with Math Concepts - Congruent Figure',
       'Familiarity with Math Concepts - Cosine',
       'Familiarity with Math Concepts - Arithmetic Mean',
       'Familiarity with Math Concepts - Probability',
       'Min in <class period> - <test lang>',
       'Min in <class period> - <Maths>',
       'Min in <class period> - <Science>',
       'No of <class period> p/wk - <test lang>',
       'No of <class period> p/wk - <Maths>',
       'No of <class period> p/wk - <Science>',
       'No of ALL <class period> a week',
       'Class Size - No of Students in <Test Language> Class',
       'OTL - Algebraic Word Problem in Math Lesson',
       'OTL - Algebraic Word Problem in Tests',
       'OTL - Procedural Task in Math Lesson',
       'OTL - Procedural Task in Tests',
       'OTL - Pure Math Reasoning in Math Lesson',
       'OTL - Pure Math Reasoning in Tests',
       'OTL - Applied Math Reasoning in Math Lesson',
       'OTL - Applied Math Reasoning in Tests',
       'Math Teaching - Teacher shows interest',
       'Math Teaching - Extra help', 'Math Teaching - Teacher helps',
       'Math Teaching - Teacher continues',
       'Math Teaching - Express opinions',
       'Teacher-Directed Instruction - Sets Clear Goals',
       'Teacher-Directed Instruction - Encourages Thinking and Reasoning',
       'Student Orientation - Differentiates Between Students When Giving Tasks',
       'Student Orientation - Assigns Complex Projects',
       'Formative Assessment - Gives Feedback',
       'Teacher-Directed Instruction - Checks Understanding',
       'Student Orientation - Has Students Work in Small Groups',
       'Teacher-Directed Instruction - Summarizes Previous Lessons',
       'Student Orientation - Plans Classroom Activities',
       'Formative Assessment - Gives Feedback on Strengths and Weaknesses',
       'Formative Assessment - Informs about Expectations',
       'Teacher-Directed Instruction - Informs about Learning Goals',
       'Formative Assessment - Tells How to Get Better',
       'Cognitive Activation - Teacher Encourages to Reflect Problems',
       'Cognitive Activation - Gives Problems that Require to Think',
       'Cognitive Activation - Asks to Use Own Procedures',
       'Cognitive Activation - Presents Problems with No Obvious Solutions',
       'Cognitive Activation - Presents Problems in Different Contexts',
       'Cognitive Activation - Helps Learn from Mistakes',
       'Cognitive Activation - Asks for Explanations',
       'Cognitive Activation - Apply What We Learned',
       'Cognitive Activation - Problems with Multiple Solutions',
       'Disciplinary Climate - Students Don\x92t Listen',
       'Disciplinary Climate - Noise and Disorder',
       'Disciplinary Climate - Teacher Has to Wait Until its Quiet',
       'Disciplinary Climate - Students Don\x92t Work Well',
       'Disciplinary Climate - Students Start Working Late',
       'Vignette Teacher Support -Homework Every Other Day/Back in Time',
       'Vignette Teacher Support - Homework Once a Week/Back in Time',
       'Vignette Teacher Support - Homework Once a Week/Not Back in Time',
       'Teacher Support - Lets Us Know We Have to Work Hard',
       'Teacher Support - Provides Extra Help When Needed',
       'Teacher Support - Helps Students with Learning',
       'Teacher Support - Gives Opportunity to Express Opinions',
       'Vignette Classroom Management - Students Frequently Interrupt/Teacher Arrives Early',
       'Vignette Classroom Management - Students Are Calm/Teacher Arrives on Time',
       'Vignette Classroom Management - Students Frequently Interrupt/Teacher Arrives Late',
       'Classroom Management - Students Listen',
       'Classroom Management - Teacher Keeps Class Orderly',
       'Classroom Management - Teacher Starts On Time',
       'Classroom Management - Wait Long to <Quiet Down>',
       'Student-Teacher Relation - Get Along with Teachers',
       'Student-Teacher Relation - Teachers Are Interested',
       'Student-Teacher Relation - Teachers Listen to Students',
       'Student-Teacher Relation - Teachers Help Students',
       'Student-Teacher Relation - Teachers Treat Students Fair',
       'Sense of Belonging - Feel Like Outsider',
       'Sense of Belonging - Make Friends Easily',
       'Sense of Belonging - Belong at School',
       'Sense of Belonging - Feel Awkward at School',
       'Sense of Belonging - Liked by Other Students',
       'Sense of Belonging - Feel Lonely at School',
       'Sense of Belonging - Feel Happy at School',
       'Sense of Belonging - Things Are Ideal at School',
       'Sense of Belonging - Satisfied at School',
       'Attitude towards School - Does Little to Prepare Me for Life',
       'Attitude towards School - Waste of Time',
       'Attitude towards School - Gave Me Confidence',
       'Attitude towards School- Useful for Job',
       'Attitude toward School - Helps to Get a Job',
       'Attitude toward School - Prepare for College',
       'Attitude toward School - Enjoy Good Grades',
       'Attitude toward School - Trying Hard is Important',
       'Perceived Control - Can Succeed with Enough Effort',
       'Perceived Control - My Choice Whether I Will Be Good',
       'Perceived Control - Problems Prevent from Putting Effort into School',
       'Perceived Control - Different Teachers Would Make Me Try Harder',
       'Perceived Control - Could Perform Well if I Wanted',
       'Perceived Control - Perform Poor Regardless',
       'Perseverance - Give up easily',
       'Perseverance - Put off difficult problems',
       'Perseverance - Remain interested',
       'Perseverance - Continue to perfection',
       'Perseverance - Exceed expectations',
       'Openness for Problem Solving - Can Handle a Lot of Information',
       'Openness for Problem Solving - Quick to Understand',
       'Openness for Problem Solving - Seek Explanations',
       'Openness for Problem Solving - Can Link Facts',
       'Openness for Problem Solving - Like to Solve Complex Problems',
       'Problem Text Message - Press every button',
       'Problem Text Message - Trace steps',
       'Problem Text Message - Manual',
       'Problem Text Message - Ask a friend',
       'Problem Route Selection - Read brochure',
       'Problem Route Selection - Study map',
       'Problem Route Selection - Leave it to brother',
       'Problem Route Selection - Just drive',
       'Problem Ticket Machine - Similarities',
       'Problem Ticket Machine - Try buttons',
       'Problem Ticket Machine - Ask for help',
       'Problem Ticket Machine - Find ticket office',
       'At Home - Desktop Computer', 'At Home - Portable laptop',
       'At Home - Tablet computer', 'At Home - Internet connection',
       'At Home - Video games console',
       'At Home - Cell phone w/o Internet',
       'At Home - Cell phone with Internet', 'At Home - Mp3/Mp4 player',
       'At Home - Printer', 'At Home - USB (memory) stick',
       'At Home - Ebook reader', 'At school - Desktop Computer',
       'At school - Portable laptop', 'At school - Tablet computer',
       'At school - Internet connection', 'At school - Printer',
       'At school - USB (memory) stick', 'At school - Ebook reader',
       'First use of computers', 'First access to Internet',
       'Internet at School', 'Internet out-of-school - Weekday',
       'Internet out-of-school - Weekend',
       'Out-of-school 8 - One player games.',
       'Out-of-school 8 - ColLabourative games.',
       'Out-of-school 8 - Use email', 'Out-of-school 8 - Chat on line',
       'Out-of-school 8 - Social networks',
       'Out-of-school 8 - Browse the Internet for fun',
       'Out-of-school 8 - Read news',
       'Out-of-school 8 - Obtain practical information from the Internet',
       'Out-of-school 8 - Download music',
       'Out-of-school 8 - Upload content',
       'Out-of-school 9 - Internet for school',
       'Out-of-school 9 - Email students',
       'Out-of-school 9 - Email teachers',
       'Out-of-school 9 - Download from School',
       'Out-of-school 9 - Announcements', 'Out-of-school 9 - Homework',
       'Out-of-school 9 - Share school material',
       'At School - Chat on line', 'At School - Email',
       'At School - Browse for schoolwork',
       'At School - Download from website', 'At School - Post on website',
       'At School - Simulations', 'At School - Practice and drilling',
       'At School - Homework', 'At School - Group work',
       'Maths lessons - Draw graph',
       'Maths lessons - Calculation with numbers',
       'Maths lessons - Geometric figures', 'Maths lessons - Spreadsheet',
       'Maths lessons - Algebra', 'Maths lessons - Histograms',
       'Maths lessons - Change in graphs',
       'Attitudes - Useful for schoolwork',
       'Attitudes - Homework more fun',
       'Attitudes - Source of information', 'Attitudes - Troublesome',
       'Attitudes - Not suitable for schoolwork',
       'Attitudes - Too unreliable', 'Miss 2 months of <ISCED 1>',
       'Miss 2 months of <ISCED 2>', 'Future Orientation - Internship',
       'Future Orientation - Work-site visits',
       'Future Orientation - Job fair',
       'Future Orientation - Career advisor at school',
       'Future Orientation - Career advisor outside school',
       'Future Orientation - Questionnaire',
       'Future Orientation - Internet search',
       'Future Orientation - Tour<ISCED 3-5> institution',
       'Future Orientation - web search <ISCED 3-5> prog',
       'Future Orientation - <country specific item>',
       'Acquired skills - Find job info - Yes, at school',
       'Acquired skills - Find job info - Yes, out of school',
       'Acquired skills - Find job info - No, never',
       'Acquired skills - Search for job - Yes, at school',
       'Acquired skills - Search for job - Yes, out of school',
       'Acquired skills - Search for job - No, never',
       'Acquired skills - Write resume - Yes, at school',
       'Acquired skills - Write resume - Yes, out of school',
       'Acquired skills - Write resume - No, never',
       'Acquired skills - Job interview - Yes, at school',
       'Acquired skills - Job interview - Yes, out of school',
       'Acquired skills - Job interview - No, never',
       'Acquired skills - ISCED 3-5 programs - Yes, at school',
       'Acquired skills - ISCED 3-5 programs - Yes, out of school',
       'Acquired skills - ISCED 3-5 programs - No, never',
       'Acquired skills - Student financing - Yes, at school',
       'Acquired skills - Student financing - Yes, out of school',
       'Acquired skills - Student financing - No, never',
       'First language learned', 'Age started learning <test language>',
       'Language spoken - Mother', 'Language spoken - Father',
       'Language spoken - Siblings', 'Language spoken - Best friend',
       'Language spoken - Schoolmates', 'Activities language - Reading',
       'Activities language - Watching TV',
       'Activities language - Internet surfing',
       'Activities language - Writing emails',
       'Types of support <test language> - remedial lessons',
       'Amount of support <test language>',
       'Attend lessons <heritage language> - focused',
       'Attend lessons <heritage language> - school subjects',
       'Instruction in <heritage language>',
       'Acculturation - Mother Immigrant (Filter)',
       'Acculturation - Enjoy <Host Culture> Friends',
       'Acculturation - Enjoy <Heritage Culture> Friends',
       'Acculturation - Enjoy <Host Culture> Celebrations',
       'Acculturation - Enjoy <Heritage Culture> Celebrations',
       'Acculturation - Spend Time with <Host Culture> Friends',
       'Acculturation - Spend Time with <Heritage Culture> Friends',
       'Acculturation - Participate in <Host Culture> Celebrations',
       'Acculturation - Participate in <Heritage Culture> Celebrations',
       'Acculturation - Perceived Host-Heritage Cultural Differences - Values',
       'Acculturation - Perceived Host-Heritage Cultural Differences - Mother Treatment',
       'Acculturation - Perceived Host-Heritage Cultural Differences - Teacher Treatment',
       'Calculator Use', 'Effort-real 1', 'Effort-real 2',
       'Difference in Effort', 'Student Questionnaire Form', 'Booklet ID',
       'Standard or simplified set of booklets', 'Age of student',
       'Grade compared to modal grade in country',
       'Unique national study programme code', 'Mathematics Anxiety',
       'Attitude towards School: Learning Outcomes',
       'Attitude towards School: Learning Activities',
       'Sense of Belonging to School', 'Father SQ ISEI', 'Mother SQ ISEI',
       "Mathematics Teacher's Classroom Management",
       'Country of Birth National Categories- Father',
       'Country of Birth National Categories- Mother',
       'Country of Birth National Categories- Self',
       'Cognitive Activation in Mathematics Lessons',
       'Cultural Distance between Host and Heritage Culture',
       'Cultural Possessions', 'Disciplinary Climate',
       'ICT Entertainment Use',
       'Index of economic, social and cultural status',
       'Experience with Applied Mathematics Tasks at School',
       'Experience with Pure Mathematics Tasks at School',
       'Attributions to Failure in Mathematics',
       'Familiarity with Mathematical Concepts',
       'Familiarity with Mathematical Concepts (Signal Detection  Adjusted)',
       'Family Structure', 'Educational level of father (ISCED)',
       'Home educational resources',
       'Acculturation: Heritage Culture Oriented  Strategies',
       'Highest educational level of parents',
       'Highest parental occupational status', 'Home Possessions',
       'ICT Use at Home for School-related Tasks',
       'Acculturation: Host Culture Oriented Strategies',
       'Attitudes Towards Computers: Limitations of the Computer as a Tool for School Learning',
       'Attitudes Towards Computers: Computer as a Tool for School Learning',
       'ICT Availability at Home', 'ICT resources',
       'ICT Availability at School', 'Immigration status',
       'Information about Careers',
       'Information about the Labour Market provided by the School',
       'Information about the Labour Market provided outside of School',
       'Instrumental Motivation for Mathematics', 'Mathematics Interest',
       'ISCED designation', 'ISCED level', 'ISCED orientation',
       'Preference for Heritage Language in Conversations with Family and Friends',
       'Language at home (3-digit code)',
       'Preference for Heritage Language in Language Reception and Production',
       'Learning time (minutes per week)  - <test language>',
       'Mathematics Behaviour', 'Mathematics Self-Efficacy',
       'Mathematics Intentions', 'Mathematics Work Ethic',
       'Educational level of mother (ISCED)',
       'Learning time (minutes per week)- <Mathematics>',
       "Mathematics Teacher's Support",
       'ISCO-08 Occupation code - Mother',
       'ISCO-08 Occupation code - Father', 'Openness for Problem Solving',
       'Out-of-School Study Time', 'Highest parental education in years',
       'Perseverance', 'Grade Repetition', 'Mathematics Self-Concept',
       'Learning time (minutes per week) - <Science>',
       'Teacher Student Relations', 'Subjective Norms in Mathematics',
       'Teacher Behaviour: Formative Assessment',
       'Teacher Behaviour: Student Orientation',
       'Teacher Behaviour: Teacher-directed Instruction',
       'Teacher Support', 'Language of the test',
       'Time of computer use (mins)', 'Use of ICT in Mathematic Lessons',
       'Use of ICT at School', 'Wealth',
       'Attitude towards School: Learning Outcomes (Anchored)',
       'Attitude towards School: Learning Activities (Anchored)',
       'Sense of Belonging to School (Anchored)',
       "Mathematics Teacher's Classroom Management (Anchored)",
       'Cognitive Activation in Mathematics Lessons (Anchored)',
       'Instrumental Motivation for Mathematics (Anchored)',
       'Mathematics Interest (Anchored)',
       'Mathematics Work Ethic (Anchored)',
       "Mathematics Teacher's Support (Anchored)",
       'Mathematics Self-Concept (Anchored)',
       'Teacher Student Relations (Anchored)',
       'Subjective Norms in Mathematics (Anchored)',
       'Plausible value 1 in mathematics',
       'Plausible value 2 in mathematics',
       'Plausible value 3 in mathematics',
       'Plausible value 4 in mathematics',
       'Plausible value 5 in mathematics',
       'Plausible value 1 in content subscale of math - Change and Relationships',
       'Plausible value 2 in content subscale of math - Change and Relationships',
       'Plausible value 3 in content subscale of math - Change and Relationships',
       'Plausible value 4 in content subscale of math - Change and Relationships',
       'Plausible value 5 in content subscale of math - Change and Relationships',
       'Plausible value 1 in content subscale of math - Quantity',
       'Plausible value 2 in content subscale of math - Quantity',
       'Plausible value 3 in content subscale of math - Quantity',
       'Plausible value 4 in content subscale of math - Quantity',
       'Plausible value 5 in content subscale of math - Quantity',
       'Plausible value 1 in content subscale of math - Space and Shape',
       'Plausible value 2 in content subscale of math - Space and Shape',
       'Plausible value 3 in content subscale of math - Space and Shape',
       'Plausible value 4 in content subscale of math - Space and Shape',
       'Plausible value 5 in content subscale of math - Space and Shape',
       'Plausible value 1 in content subscale of math - Uncertainty and Data',
       'Plausible value 2 in content subscale of math - Uncertainty and Data',
       'Plausible value 3 in content subscale of math - Uncertainty and Data',
       'Plausible value 4 in content subscale of math - Uncertainty and Data',
       'Plausible value 5 in content subscale of math - Uncertainty and Data',
       'Plausible value 1 in process subscale of math - Employ',
       'Plausible value 2 in process subscale of math - Employ',
       'Plausible value 3 in process subscale of math - Employ',
       'Plausible value 4 in process subscale of math - Employ',
       'Plausible value 5 in process subscale of math - Employ',
       'Plausible value 1 in process subscale of math - Formulate',
       'Plausible value 2 in process subscale of math - Formulate',
       'Plausible value 3 in process subscale of math - Formulate',
       'Plausible value 4 in process subscale of math - Formulate',
       'Plausible value 5 in process subscale of math - Formulate',
       'Plausible value 1 in process subscale of math - Interpret',
       'Plausible value 2 in process subscale of math - Interpret',
       'Plausible value 3 in process subscale of math - Interpret',
       'Plausible value 4 in process subscale of math - Interpret',
       'Plausible value 5 in process subscale of math - Interpret',
       'Plausible value 1 in reading', 'Plausible value 2 in reading',
       'Plausible value 3 in reading', 'Plausible value 4 in reading',
       'Plausible value 5 in reading', 'Plausible value 1 in science',
       'Plausible value 2 in science', 'Plausible value 3 in science',
       'Plausible value 4 in science', 'Plausible value 5 in science',
       'FINAL STUDENT WEIGHT', 'FINAL STUDENT REPLICATE BRR-FAY WEIGHT1',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT2',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT3',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT4',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT5',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT6',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT7',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT8',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT9',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT10',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT11',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT12',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT13',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT14',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT15',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT16',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT17',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT18',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT19',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT20',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT21',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT22',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT23',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT24',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT25',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT26',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT27',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT28',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT29',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT30',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT31',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT32',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT33',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT34',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT35',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT36',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT37',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT38',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT39',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT40',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT41',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT42',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT43',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT44',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT45',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT46',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT47',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT48',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT49',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT50',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT51',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT52',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT53',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT54',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT55',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT56',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT57',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT58',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT59',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT60',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT61',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT62',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT63',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT64',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT65',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT66',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT67',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT68',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT69',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT70',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT71',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT72',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT73',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT74',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT75',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT76',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT77',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT78',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT79',
       'FINAL STUDENT REPLICATE BRR-FAY WEIGHT80',
       'RANDOMIZED FINAL VARIANCE STRATUM (1-80)',
       'RANDOMLY ASSIGNED VARIANCE UNIT',
       'Senate weight - sum of weight within the country is 1000',
       'Date of the database creation'], dtype=object)

The main file: data of the PISA 2012 survey

In [9]:
# load the PISA 2012 survey data (spoiler: it would take a lot! of time on my laptop)
pisa_data = pd.read_csv('PISA_data\pisa2012.csv', encoding='latin-1', dtype='unicode')
pisa_data.head()
Out[9]:
Unnamed: 0 CNT SUBNATIO STRATUM OECD NC SCHOOLID STIDSTD ST01Q01 ST02Q01 ... W_FSTR75 W_FSTR76 W_FSTR77 W_FSTR78 W_FSTR79 W_FSTR80 WVARSTRR VAR_UNIT SENWGT_STU VER_STU
0 1 Albania 0080000 ALB0006 Non-OECD Albania 0000001 00001 10 1 ... 13.7954 13.9235 13.1249 13.1249 4.3389 13.0829 19 1 0.2098 22NOV13
1 2 Albania 0080000 ALB0006 Non-OECD Albania 0000001 00002 10 1 ... 13.7954 13.9235 13.1249 13.1249 4.3389 13.0829 19 1 0.2098 22NOV13
2 3 Albania 0080000 ALB0006 Non-OECD Albania 0000001 00003 9 1 ... 12.7307 12.7307 12.7307 12.7307 4.2436 12.7307 19 1 0.1999 22NOV13
3 4 Albania 0080000 ALB0006 Non-OECD Albania 0000001 00004 9 1 ... 12.7307 12.7307 12.7307 12.7307 4.2436 12.7307 19 1 0.1999 22NOV13
4 5 Albania 0080000 ALB0006 Non-OECD Albania 0000001 00005 9 1 ... 12.7307 12.7307 12.7307 12.7307 4.2436 12.7307 19 1 0.1999 22NOV13

5 rows × 636 columns

In [10]:
pisa_data.shape
Out[10]:
(485490, 636)

What is the structure of your dataset?

The main dataset contains 485490 rows, each one representing one student, and 636 columns with coded names.

A second files provide the dictionary for the criptic column names. There we see that each row of the main dataset

  • starts with 6 columns about the location of the school attended by the student (e.g. country code, subregion, school ID)
  • follows with the sudent ID column and a lot of personal information about the student and their family, their economic situation and experience in school (both emotional and with the school teaching), and a lot more.
  • ends with their PLAUSIBLE results at the PISA tests (math and math subscales, science and reading; NO financial literacy) and the weight of the entry within their country, and
  • the data on which the entry was created.

What is/are the main feature(s) of interest in your dataset?

The description given above is a very summary one, and the dataset is truly fascinating for the possibilities of analysis it offers, even if it contains only the results for the main survey (the one used in all the countries).

Sadly, here I need to cut on the information I handle, because of scarcity of resources (time and computer memory). Anyway a detailed analysis is (luckily) not requested.

As said PISA 2012 focussed on mathematics, so in the dataset there are plausible scores about subset of mathematic competencies.

Moreover, Countries that participated in the survey have different languages, but also different cultures and writing systems. I will select a subset of countries such as their writing system belongs to one of these groups:

  • alphabetic, with a highly phonemic orthography (or shallow orthography): Spanish, Finnish, Italian, German *
  • alphabetic, with a more complex relationship between orthography and pronunciation, a deeper orthography: English, French, Arabic *
  • logographic: Chinese, Japanese (Korean, but keeping in mind that Korean logograms, still studied in school, as been long replaced in use by Korean alphabet) *

* I wanted to make the distinction between languages "that are written how they are pronunced", languages "that you cannot guess how a word is written just by hearing it" and languages that use ideograms... Wikipedia and the article Getting to the bottom of orthographic depth helped me putting things down a little more precisely.

What features in the dataset do you think will help support your investigation into your feature(s) of interest?

I will keep for sure the mathematic and reading scores, as well as the language(s) spoken by the students and the language in which the test was administered (TESTLANG is the column heading).

Since the mathematic scores have subcategories, it would probably be interesting to have a look at them separately.

(More about the scores and proficiency levels in the next cells)

There is a wealth of information about the students' background as well, and it seems like a good idea to see if there are influential factors there, for instance the index of economic, social and cultural status (column "ESCS" in the dataset); but to really understand what information I have that should be used, I first need to trim the database and explore it better.

Proficiency levels

In the Technical Report (from p.296 on) it is possible to find the scales used to record the proficiency for the different areas:

  • Mathematics: >1, 1, 2, 3, 4, 5, 6 (highest level) *

  • Reading and Science: I couldn't find the scales in the document. There is, however, indicated that the scales are just one for each discipline and are respectively the same as 2009 and 2006.

    • Reading: 1a, 1b, 2, 3, 4, 5, 6 (highest level)
    • Science: 1, 2, 3, 4, 5, 6 (highest level)

All this proficiency levels are provided with a description of the competencies associated with them.

(problem solving and financial literacy are also described, but the relative plausible scores are not in the dataset)

*NB. this is the main scale for mathematics, and also for all the subscales of different competencies tested (e.g.: "formulating situations mathematically", "employing mathematical concepts, facts, ...")

However, the results are not given by those scales!

We only have PLAUSIBLE VALUES: From the Technical report:

Plausible values

As with all item response scaling models, student proficiencies (or measures) are not observed; they are missing data that must be inferred from the observed item responses. There are several possible alternative approaches for making this inference. PISA uses the imputation methodology usually referred to as plausible values (PVs). PVs are a selection of likely proficiencies for students that attained each score. For each scale and subscale, five plausible values per student are included in the international database. Using item parameters anchored at their estimated values from the international calibration, the plausible values are random draws from the marginal posterior of the latent distribution for each student.

About the plausible scales (p.156):

Sixty-five plausible values, five for each of the 13 PISA 2012 scales are included in the PISA 2012 database. PV1MATH to PV5MATH are for mathematical literacy; PV1SCIE to PV5SCIE for scientific literacy, PV1READ to PV5READ for reading literacy, PV1CPRO to PV5CPRO for computer problem solving assessment, PV1CMAT to PV5CMAT for the computer-based mathematics assessment and PV1CREA to PV5CREA for digital reading assessment. For the four mathematics content subscales, change and relationships, quantity, space and shape, uncertainty and data, the plausible values variables are PV1MACC to PV5MACC, PV1MACQ to PV5MACQ, PV1MACS to PV5MACS, and PV1MACU to PV5MACU respectively. For the three mathematics process subscales employ, formulate and interpret, the plausible values variables are PV1MAPE to PV5MAPE, PV1MAPF to PV5MAPF, and PV1MAPI to PV5MAPI respectively

My understanding about these evil PLAUSIBLE SCORES

Reading through the Technical report, I've come up with this layman explanation:

For every student, for every scale and subscale, the results of the tests have been taken and used to compute 5 plausible results that student could have reached if they had taken all the PISA tests, not only the subset contained in the booklet they tackled. This has been done for accuracy in later estimating the parameters of the population (all the students of a Country). Accuracy of those parameters is also the reason it would be better to use the weight of each student and also to repeat the calculations 5 times per scale (one for each plausible score column).

The plausible score is given in a scale that can be converted to the relative proficiency level according to the bands provided in the Technical report:

Main mathematical literacy levels and subscales as well - p. 297 of the Technical report

Level : Score points on the PISA scale

  • 6 : Above 669.3
  • 5 : From 607.0 to less than 669.3
  • 4 : From 544.7 to less than 607.0
  • 3 : From 482.4 to less than 544.7
  • 2 : From 420.1 to less than 482.4
  • 1 : From 357.8 to less than 420.1
  • Below level 1 : Below 357.8
Reading literacy performance band definitions on the PISA scale - p. 265 of the PISA 2009 Technical report

Level : Score points on the PISA scale

  • 6 : Higher than 698.32
  • 5 : Higher than 625.61 and less than or equal to 698.32
  • 4 : Higher than 552.89 and less than or equal to 625.61
  • 3 : Higher than 480.18 and less than or equal to 552.89
  • 2 : Higher than 407.47 and less than or equal to 480.18
  • 1a : Higher than 334.75 and less than or equal to 407.47
  • 1b : 262.04 to less than or equal to 334.75
Scientific literacy performance band definitions on the PISA scale - p. 293 of the PISA 2006 Technical report

Level : Score points on the PISA scale

  • 6 : Above 707.9
  • 5 : 633.3 to 707.9
  • 4 : 558.7 to 633.3
  • 3 : 484.1 to 558.7
  • 2 : 409.5 to 484.1
  • 1 : 334.9 to 409.5

About the weights

Every row/student has a weight (more than one, to be precise, but..).

The reason for it resides, if I got it right, in the sampling process of subregions, schools and students done in every Country and the reason to use it (them) is, in the end, to better represent the country parameters.

I'm planning to regroup the students by different parameters (primary first language = language at school = language of the PISA test), so they won't be divided by Country and won't rapresent country level proficiency.

Therefore I think it would be wrong to use the weights within this project.

More Wrangling

Finish Gathering

As said, I will keep only the data about the countries speaking one or more of these lenguages: Spanish, Finnish, Italian, German, English, French, Arabic, Chinese, and Japanese.

The countries in the database are:

In [11]:
# find the list of countries contained
pisa_data.CNT.unique()
Out[11]:
array(['Albania', 'United Arab Emirates', 'Argentina', 'Australia',
       'Austria', 'Belgium', 'Bulgaria', 'Brazil', 'Canada',
       'Switzerland', 'Chile', 'Colombia', 'Costa Rica', 'Czech Republic',
       'Germany', 'Denmark', 'Spain', 'Estonia', 'Finland', 'France',
       'United Kingdom', 'Greece', 'Hong Kong-China', 'Croatia',
       'Hungary', 'Indonesia', 'Ireland', 'Iceland', 'Israel', 'Italy',
       'Jordan', 'Japan', 'Kazakhstan', 'Korea', 'Liechtenstein',
       'Lithuania', 'Luxembourg', 'Latvia', 'Macao-China', 'Mexico',
       'Montenegro', 'Malaysia', 'Netherlands', 'Norway', 'New Zealand',
       'Peru', 'Poland', 'Portugal', 'Qatar', 'China-Shanghai',
       'Perm(Russian Federation)', 'Florida (USA)', 'Connecticut (USA)',
       'Massachusetts (USA)', 'Romania', 'Russian Federation',
       'Singapore', 'Serbia', 'Slovak Republic', 'Slovenia', 'Sweden',
       'Chinese Taipei', 'Thailand', 'Tunisia', 'Turkey', 'Uruguay',
       'United States of America', 'Vietnam'], dtype=object)

By the way, these are the full names, not the 3 character Country codes as indicated by the data dictionary.. which is good, because it is easier for me to check on the Wikipedia pages the spoken languages, for the countries I have doubts about.

I'm a bit puzzled by the presence of United States of America and, separately, Florida, Connecticut and Massachusetts. Assuming that all the data included in the dataset are valid (the survey data have been revised several times, if we believe the Technical report -and we have to) the reason is possibly to be find in the sampling choices and volunteering of schools to conduct the survey. Provided there are no duplicates, it is no worries to me.

I will keep these 37 Countries/economies: 'United Arab Emirates', 'Argentina', 'Australia', 'Austria', 'Belgium', 'Canada', 'Switzerland', 'Chile', 'Colombia', 'Costa Rica', 'Germany', 'Spain', 'Finland', 'France', 'United Kingdom', 'Hong Kong-China', 'Ireland', 'Italy', 'Japan', 'Korea', 'Liechtenstein', 'Luxembourg', 'Macao-China', 'Mexico', 'New Zealand', 'Peru', 'Qatar', 'China-Shanghai', 'Florida (USA)', 'Connecticut (USA)', 'Massachusetts (USA)', 'Singapore', 'Chinese Taipei', 'Tunisia', 'Uruguay', 'United States of America', 'Vietnam'

In [12]:
# list of Countries to keep
to_keep = ['United Arab Emirates', 'Argentina', 'Australia', 'Austria', 'Belgium', 'Canada', 'Switzerland', 'Chile', 'Colombia', 'Costa Rica', 'Germany', 'Spain', 'Finland', 'France', 'United Kingdom', 'Hong Kong-China', 'Ireland', 'Italy', 'Japan', 'Korea', 'Liechtenstein', 'Luxembourg', 'Macao-China', 'Mexico', 'New Zealand', 'Peru', 'Qatar', 'China-Shanghai', 'Florida (USA)', 'Connecticut (USA)', 'Massachusetts (USA)', 'Singapore', 'Chinese Taipei', 'Tunisia', 'Uruguay', 'United States of America', 'Vietnam']


# select only the desired Countries:
# reusing the df name in order not to use up too much memory
pisa_data = pisa_data.loc[pisa_data.CNT.isin(to_keep)]

# clean memory from trial results (answer "y" to prompt)
%reset Out
Once deleted, variables cannot be recovered. Proceed (y/[n])? y
Flushing output cache (6 entries)
In [13]:
pisa_data.shape
Out[13]:
(314831, 636)

We had 485490 rows, now they are 314831 (we dismissed roughly 35% of the rows).

Time to eliminate some of the columns. I'm not ready to dismiss most of the columns, even if at the end I will use only a few of the 636 that are in the dataset.

I'm quite confident in dismissing the following two chunks though (I'm not bothering with picking up single columns):

  • from : EC04Q01A Acquired skills - Find job info - Yes, at school

    to: EC04Q06C Acquired skills - Student financing - No, never

  • from: W_FSTR1 FINAL STUDENT REPLICATE BRR-FAY WEIGHT1

    to: VAR_UNIT RANDOMLY ASSIGNED VARIANCE UNIT

In [14]:
# delete the columns listed above
pisa_data.drop(pisa_data.loc[:, 'EC04Q01A':'EC04Q06C'].columns, axis=1, inplace=True)
pisa_data.drop(pisa_data.loc[:, 'W_FSTR1':'VAR_UNIT'].columns, axis=1, inplace=True)
# since I'm doing this, I'll drop also the first, unnamed, column
pisa_data.drop(pisa_data.iloc[:,0:1].columns, axis=1, inplace=True)
In [15]:
# looking a bit better?
pisa_data.shape
Out[15]:
(314831, 535)

From 636 columns to 535: probably still 510-520 useless columns, but 101 column better.

I'll save my subset of the database as it is now, so that if needed, I can restart running the cells from the last one before the next section, "Select variables".

In [16]:
pisa_data.dtypes
Out[16]:
CNT           object
SUBNATIO      object
STRATUM       object
OECD          object
NC            object
               ...  
PV4SCIE       object
PV5SCIE       object
W_FSTUWT      object
SENWGT_STU    object
VER_STU       object
Length: 535, dtype: object
In [17]:
# save the dataset, keep ',' as separator, keep 'utf8' as encoding
pisa_data.to_csv('PISA_data\selected_countries_subset.csv', index=False, encoding='latin-1')
print('Saved')
Saved
In [18]:
# clean a bit of memory
pisa_data = ''
%reset Out
Once deleted, variables cannot be recovered. Proceed (y/[n])? y
Flushing output cache (3 entries)
In [19]:
# in the case I need to restart the kernel or the computer
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline

# reload the dictionary
pisa_variables = pd.read_csv('PISA_data\pisadict2012.csv', encoding='latin-1', dtype='unicode')
pisa_variables.columns = ['code', 'x']


# load my selected Countries data
pisa_subset = pd.read_csv('PISA_data\selected_countries_subset.csv', encoding='latin-1', dtype='unicode')
#have a look
pisa_subset.head()
Out[19]:
CNT SUBNATIO STRATUM OECD NC SCHOOLID STIDSTD ST01Q01 ST02Q01 ST03Q01 ... PV4READ PV5READ PV1SCIE PV2SCIE PV3SCIE PV4SCIE PV5SCIE W_FSTUWT SENWGT_STU VER_STU
0 United Arab Emirates 7840000 ARE0762 Non-OECD United Arab Emirates 0000001 00001 10 1 08 ... 310.3345 241.3672 397.2771 387.9523 375.8299 365.5725 351.5852 2.8838 0.071 22NOV13
1 United Arab Emirates 7840000 ARE0762 Non-OECD United Arab Emirates 0000001 00002 10 1 06 ... 198.2226 243.1315 301.5106 267.0086 315.4979 289.3883 305.2406 2.8838 0.071 22NOV13
2 United Arab Emirates 7840000 ARE0762 Non-OECD United Arab Emirates 0000001 00003 10 1 09 ... 251.0707 322.4439 293.771 278.8512 200.5222 234.0918 327.3405 2.8838 0.071 22NOV13
3 United Arab Emirates 7840000 ARE0762 Non-OECD United Arab Emirates 0000001 00004 9 1 12 ... 214.1813 252.6746 329.2055 298.4334 281.6486 319.8806 318.9482 2.8838 0.071 22NOV13
4 United Arab Emirates 7840000 ARE0762 Non-OECD United Arab Emirates 0000001 00005 10 1 03 ... 301.6735 249.547 378.1611 405.2033 389.351 408.0007 410.7982 2.8838 0.071 22NOV13

5 rows × 535 columns

Select variables

In [20]:
pisa_subset.shape
Out[20]:
(314831, 535)
In [21]:
pisa_subset.CNT.count()
Out[21]:
314831

pisa_variables that look interesting:

  • 'Country code 3-character',
  • 'OECD country',
  • 'Student ID', NB: it starts from 1 for every Country, it needs to be coupled with the Country variable to be unique
  • 'Gender',
  • 'Mother',
  • 'Mother Current Job Status',
  • 'Father',
  • 'Father Current Job Status',
  • 'Highest educational level of parents',
  • 'Country of Birth International - Self',
  • 'First language learned',
  • 'Age started learning ',
  • 'Language of the test',
  • 'Language spoken - Mother',
  • 'Language spoken - Father',
  • 'Standard or simplified set of booklets',
  • 'Attitude towards School: Learning Outcomes',
  • 'Attitude towards School: Learning Activities',
  • 'Sense of Belonging to School',
  • "Mathematics Teacher's Classroom Management",
  • 'Cognitive Activation in Mathematics Lessons',
  • 'Instrumental Motivation for Mathematics',
  • 'Mathematics Interest',
  • 'Mathematics Work Ethic',
  • "Mathematics Teacher's Support",
  • 'Mathematics Self-Concept',
  • 'Teacher Student Relations',
  • 'Subjective Norms in Mathematics',
  • 'Index of economic, social and cultural status',
  • 'Home educational resources',
  • 'Preference for Heritage Language in Conversations with Family and Friends',
  • 'Language at home (3-digit code)',
  • 'International Language at Home',
  • 'Preference for Heritage Language in Language Reception and Production',
  • 'Plausible value 1 in mathematics',
  • 'Plausible value 2 in mathematics',
  • 'Plausible value 3 in mathematics',
  • 'Plausible value 4 in mathematics',
  • 'Plausible value 5 in mathematics',
  • 'Plausible value 1 in content subscale of math - Change and Relationships',
  • 'Plausible value 2 in content subscale of math - Change and Relationships',
  • 'Plausible value 3 in content subscale of math - Change and Relationships',
  • 'Plausible value 4 in content subscale of math - Change and Relationships',
  • 'Plausible value 5 in content subscale of math - Change and Relationships',
  • 'Plausible value 1 in content subscale of math - Quantity',
  • 'Plausible value 2 in content subscale of math - Quantity',
  • 'Plausible value 3 in content subscale of math - Quantity',
  • 'Plausible value 4 in content subscale of math - Quantity',
  • 'Plausible value 5 in content subscale of math - Quantity',
  • 'Plausible value 1 in content subscale of math - Space and Shape',
  • 'Plausible value 2 in content subscale of math - Space and Shape',
  • 'Plausible value 3 in content subscale of math - Space and Shape',
  • 'Plausible value 4 in content subscale of math - Space and Shape',
  • 'Plausible value 5 in content subscale of math - Space and Shape',
  • 'Plausible value 1 in content subscale of math - Uncertainty and Data',
  • 'Plausible value 2 in content subscale of math - Uncertainty and Data',
  • 'Plausible value 3 in content subscale of math - Uncertainty and Data',
  • 'Plausible value 4 in content subscale of math - Uncertainty and Data',
  • 'Plausible value 5 in content subscale of math - Uncertainty and Data',
  • 'Plausible value 1 in process subscale of math - Employ',
  • 'Plausible value 2 in process subscale of math - Employ',
  • 'Plausible value 3 in process subscale of math - Employ',
  • 'Plausible value 4 in process subscale of math - Employ',
  • 'Plausible value 5 in process subscale of math - Employ',
  • 'Plausible value 1 in process subscale of math - Formulate',
  • 'Plausible value 2 in process subscale of math - Formulate',
  • 'Plausible value 3 in process subscale of math - Formulate',
  • 'Plausible value 4 in process subscale of math - Formulate',
  • 'Plausible value 5 in process subscale of math - Formulate',
  • 'Plausible value 1 in process subscale of math - Interpret',
  • 'Plausible value 2 in process subscale of math - Interpret',
  • 'Plausible value 3 in process subscale of math - Interpret',
  • 'Plausible value 4 in process subscale of math - Interpret',
  • 'Plausible value 5 in process subscale of math - Interpret',
  • 'Plausible value 1 in reading',
  • 'Plausible value 2 in reading',
  • 'Plausible value 3 in reading',
  • 'Plausible value 4 in reading',
  • 'Plausible value 5 in reading',
  • 'Plausible value 1 in science',
  • 'Plausible value 2 in science',
  • 'Plausible value 3 in science',
  • 'Plausible value 4 in science',
  • 'Plausible value 5 in science'

There would be more, but already like this they are more than I will use.

In [22]:
descriptions = ['Country code 3-character',
                'OECD country',
                'Student ID',
                'Gender',
                'Mother<Highest Schooling>',
                'Mother Current Job Status',
                'Father<Highest Schooling>',
                'Father Current Job Status',
                'Highest educational level of parents',
                'Country of Birth International - Self',
                'First language learned',
                'Age started learning <test language>',
                'Language of the test',
                'Language spoken - Mother',
                'Language spoken - Father',
                'Standard or simplified set of booklets',
                'Attitude towards School: Learning Outcomes',
                'Attitude towards School: Learning Activities',
                'Sense of Belonging to School',
                "Mathematics Teacher's Classroom Management",
                'Cognitive Activation in Mathematics Lessons',
                'Instrumental Motivation for Mathematics',
                'Mathematics Interest',
                'Mathematics Work Ethic',
                "Mathematics Teacher's Support",
                'Mathematics Self-Concept',
                'Teacher Student Relations',
                'Subjective Norms in Mathematics',
                'Index of economic, social and cultural status',
                'Home educational resources',
                'Preference for Heritage Language in Conversations with Family and Friends',
                'Language at home (3-digit code)',
                'International Language at Home',
                'Preference for Heritage Language in Language Reception and Production',
                'Plausible value 1 in mathematics',
                'Plausible value 2 in mathematics',
                'Plausible value 3 in mathematics',
                'Plausible value 4 in mathematics',
                'Plausible value 5 in mathematics',
                'Plausible value 1 in content subscale of math - Change and Relationships',
                'Plausible value 2 in content subscale of math - Change and Relationships',
                'Plausible value 3 in content subscale of math - Change and Relationships',
                'Plausible value 4 in content subscale of math - Change and Relationships',
                'Plausible value 5 in content subscale of math - Change and Relationships',
                'Plausible value 1 in content subscale of math - Quantity',
                'Plausible value 2 in content subscale of math - Quantity',
                'Plausible value 3 in content subscale of math - Quantity',
                'Plausible value 4 in content subscale of math - Quantity',
                'Plausible value 5 in content subscale of math - Quantity',
                'Plausible value 1 in content subscale of math - Space and Shape',
                'Plausible value 2 in content subscale of math - Space and Shape',
                'Plausible value 3 in content subscale of math - Space and Shape',
                'Plausible value 4 in content subscale of math - Space and Shape',
                'Plausible value 5 in content subscale of math - Space and Shape',
                'Plausible value 1 in content subscale of math - Uncertainty and Data',
                'Plausible value 2 in content subscale of math - Uncertainty and Data',
                'Plausible value 3 in content subscale of math - Uncertainty and Data',
                'Plausible value 4 in content subscale of math - Uncertainty and Data',
                'Plausible value 5 in content subscale of math - Uncertainty and Data',
                'Plausible value 1 in process subscale of math - Employ',
                'Plausible value 2 in process subscale of math - Employ',
                'Plausible value 3 in process subscale of math - Employ',
                'Plausible value 4 in process subscale of math - Employ',
                'Plausible value 5 in process subscale of math - Employ',
                'Plausible value 1 in process subscale of math - Formulate',
                'Plausible value 2 in process subscale of math - Formulate',
                'Plausible value 3 in process subscale of math - Formulate',
                'Plausible value 4 in process subscale of math - Formulate',
                'Plausible value 5 in process subscale of math - Formulate',
                'Plausible value 1 in process subscale of math - Interpret',
                'Plausible value 2 in process subscale of math - Interpret',
                'Plausible value 3 in process subscale of math - Interpret',
                'Plausible value 4 in process subscale of math - Interpret',
                'Plausible value 5 in process subscale of math - Interpret',
                'Plausible value 1 in reading',
                'Plausible value 2 in reading',
                'Plausible value 3 in reading',
                'Plausible value 4 in reading',
                'Plausible value 5 in reading',
                'Plausible value 1 in science',
                'Plausible value 2 in science',
                'Plausible value 3 in science',
                'Plausible value 4 in science',
                'Plausible value 5 in science']
In [23]:
#get the column names
columns_to_keep = pisa_variables[pisa_variables['x'].isin(descriptions)]['code'].tolist()
In [24]:
pisa_chosen_var = pisa_subset[columns_to_keep]
In [25]:
pisa_chosen_var
Out[25]:
CNT OECD STIDSTD ST04Q01 ST13Q01 ST15Q01 ST17Q01 ST19Q01 ST20Q01 ST25Q01 ... PV1READ PV2READ PV3READ PV4READ PV5READ PV1SCIE PV2SCIE PV3SCIE PV4SCIE PV5SCIE
0 United Arab Emirates Non-OECD 00001 Male <ISCED level 3A> Other (e.g. home duties, retired) <ISCED level 3A> Working full-time <for pay> Other country Language of the test ... 313.5423 309.5325 267.0294 310.3345 241.3672 397.2771 387.9523 375.8299 365.5725 351.5852
1 United Arab Emirates Non-OECD 00002 Male <ISCED level 3A> Other (e.g. home duties, retired) <ISCED level 3A> Not working, but looking for a job Country of test Language of the test ... 269.5957 146.0961 261.5762 198.2226 243.1315 301.5106 267.0086 315.4979 289.3883 305.2406
2 United Arab Emirates Non-OECD 00003 Male NaN Other (e.g. home duties, retired) <ISCED level 2> Working part-time <for pay> Country of test Language of the test ... 267.9116 257.4863 226.2104 251.0707 322.4439 293.771 278.8512 200.5222 234.0918 327.3405
3 United Arab Emirates Non-OECD 00004 Male <ISCED level 3A> Other (e.g. home duties, retired) NaN Working full-time <for pay> Other country Language of the test ... 217.389 178.8957 183.7074 214.1813 252.6746 329.2055 298.4334 281.6486 319.8806 318.9482
4 United Arab Emirates Non-OECD 00005 Male <ISCED level 3A> Other (e.g. home duties, retired) <ISCED level 3A> Other (e.g. home duties, retired) Country of test Language of the test ... 279.219 303.2774 285.6346 301.6735 249.547 378.1611 405.2033 389.351 408.0007 410.7982
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
314826 Vietnam Non-OECD 04955 Female <ISCED level 2> Other (e.g. home duties, retired) He did not complete <ISCED level 1> Other (e.g. home duties, retired) Country of test Language of the test ... 460.2272 476.1134 472.9362 472.1419 481.6736 559.8098 528.1052 519.7128 535.5651 538.3626
314827 Vietnam Non-OECD 04956 Male She did not complete <ISCED level 1> Other (e.g. home duties, retired) <ISCED level 2> Other (e.g. home duties, retired) Country of test Language of the test ... 490.9325 479.7053 448.4294 565.5134 451.6372 538.7355 493.9761 493.0436 561.1153 535.0056
314828 Vietnam Non-OECD 04957 Male <ISCED level 1> Working full-time <for pay> <ISCED level 2> Working full-time <for pay> Country of test Language of the test ... 462.6239 514.7503 434.5558 457.8122 511.5425 536.8706 571.3726 488.3812 548.9929 563.9127
314829 Vietnam Non-OECD 04958 Male <ISCED level 2> Other (e.g. home duties, retired) <ISCED level 3A> Other (e.g. home duties, retired) Country of test Language of the test ... 505.2873 522.1282 513.3068 528.5437 522.9301 511.0407 532.4879 524.0955 551.1376 514.7706
314830 Vietnam Non-OECD 04959 Female She did not complete <ISCED level 1> Other (e.g. home duties, retired) <ISCED level 3A> Working full-time <for pay> Country of test Language of the test ... 532.3506 483.1034 479.9261 459.2741 488.6635 530.6229 473.7411 477.4711 477.4711 505.4457

314831 rows × 84 columns

Assess, Clean and Test

In [26]:
pisa_chosen_var.shape
Out[26]:
(314831, 84)
In [27]:
pisa_chosen_var.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 314831 entries, 0 to 314830
Data columns (total 84 columns):
CNT         314831 non-null object
OECD        314831 non-null object
STIDSTD     314831 non-null object
ST04Q01     314831 non-null object
ST13Q01     295559 non-null object
ST15Q01     303438 non-null object
ST17Q01     285911 non-null object
ST19Q01     293592 non-null object
ST20Q01     308622 non-null object
ST25Q01     300461 non-null object
EC05Q01     95301 non-null object
EC06Q01     30918 non-null object
EC07Q01     30299 non-null object
EC07Q02     29819 non-null object
EASY        314831 non-null object
ATSCHL      202601 non-null object
ATTLNACT    202093 non-null object
BELONG      203134 non-null object
CLSMAN      202822 non-null object
COGACT      204216 non-null object
ESCS        309909 non-null object
HEDRES      309882 non-null object
HISCED      305922 non-null object
INSTMOT     205531 non-null object
INTMAT      205712 non-null object
LANGCOMM    30540 non-null object
LANGN       312263 non-null object
LANGRPPD    29791 non-null object
MATWKETH    204331 non-null object
MTSUP       203436 non-null object
SCMAT       204183 non-null object
STUDREL     203429 non-null object
SUBNORM     205532 non-null object
TESTLANG    314218 non-null object
PV1MATH     314831 non-null object
PV2MATH     314831 non-null object
PV3MATH     314831 non-null object
PV4MATH     314831 non-null object
PV5MATH     314831 non-null object
PV1MACC     309853 non-null object
PV2MACC     309853 non-null object
PV3MACC     309853 non-null object
PV4MACC     309853 non-null object
PV5MACC     309853 non-null object
PV1MACQ     309853 non-null object
PV2MACQ     309853 non-null object
PV3MACQ     309853 non-null object
PV4MACQ     309853 non-null object
PV5MACQ     309853 non-null object
PV1MACS     309853 non-null object
PV2MACS     309853 non-null object
PV3MACS     309853 non-null object
PV4MACS     309853 non-null object
PV5MACS     309853 non-null object
PV1MACU     309853 non-null object
PV2MACU     309853 non-null object
PV3MACU     309853 non-null object
PV4MACU     309853 non-null object
PV5MACU     309853 non-null object
PV1MAPE     300780 non-null object
PV2MAPE     300780 non-null object
PV3MAPE     300780 non-null object
PV4MAPE     300780 non-null object
PV5MAPE     300780 non-null object
PV1MAPF     300780 non-null object
PV2MAPF     300780 non-null object
PV3MAPF     300780 non-null object
PV4MAPF     300780 non-null object
PV5MAPF     300780 non-null object
PV1MAPI     300780 non-null object
PV2MAPI     300780 non-null object
PV3MAPI     300780 non-null object
PV4MAPI     300780 non-null object
PV5MAPI     300780 non-null object
PV1READ     314831 non-null object
PV2READ     314831 non-null object
PV3READ     314831 non-null object
PV4READ     314831 non-null object
PV5READ     314831 non-null object
PV1SCIE     314831 non-null object
PV2SCIE     314831 non-null object
PV3SCIE     314831 non-null object
PV4SCIE     314831 non-null object
PV5SCIE     314831 non-null object
dtypes: object(84)
memory usage: 201.8+ MB

ISSUES:

Missing data: a lot

It is not data that I can retrieve somewhere, therefore, after a bit of cleaning I will assess the columns I want to use and decide what to do.

Tidiness

The dataset looks OK. Anyway, I still have more variables than I will use, I will cut on them later, after exploring a little.

Quality
  • columns name need to be more informative
  • all variables are string, it is not correct. Some are strings, other are int or float and other are categories. It will be easier to decide which one to convert after changing the column names.
  • plausible values: need to be converted in plausible levels (here some info), reported in the next cell

I'll fix these quality issue and then look for more

Test values conversion bands

Main mathematical literacy levels and subscales as well

Level : Score points on the PISA scale

  • 6 : value >= 669.3
  • 5 : 607.0 <= value < 669.3
  • 4 : 544.7 <= value < 607.0
  • 3 : 482.4 <= value < 544.7
  • 2 : 420.1 <= value < 482.4
  • 1 : 357.8 <= value < 420.1
  • Below 1 : value < 357.8
Reading literacy performance band definitions on the PISA scale

Level : Score points on the PISA scale

  • 6 : value > 698.32
  • 5 : 625.61 < value <= 698.32
  • 4 : 552.89 < value <= 625.61
  • 3 : 480.18 < value <= 552.89
  • 2 : 407.47 < value <= 480.18
  • 1a : 334.75 < value <= 407.47
  • 1b : 262.04 < value <= 334.75
Scientific literacy performance band definitions on the PISA scale

Level : Score points on the PISA scale

  • 6 : value > 707.9
  • 5 : 633.3 < value <= 707.9
  • 4 : 558.7 < value <= 633.3
  • 3 : 484.1 < value <= 558.7
  • 2 : 409.5 < value <= 484.1
  • 1 : 334.9 < value <= 409.5 (for this one there was no indication of where the bounds were included, I decided to align it with the reading scale)

Make a copy of the dataset

In [28]:
# make a copy
df = pisa_chosen_var.copy() # sorry, but df is a convenient name
print('Done!')
Done!

I'll deal first with the plausible value columns (both conversion to level and levels to categories)

The column names (from Excel) are:

mathematics

'PV1MATH', 'PV2MATH', 'PV3MATH', 'PV4MATH', 'PV5MATH', 'PV1MACC', 'PV2MACC', 'PV3MACC', 'PV4MACC', 'PV5MACC', 'PV1MACQ', 'PV2MACQ', 'PV3MACQ', 'PV4MACQ', 'PV5MACQ', 'PV1MACS', 'PV2MACS', 'PV3MACS', 'PV4MACS', 'PV5MACS', 'PV1MACU', 'PV2MACU', 'PV3MACU', 'PV4MACU', 'PV5MACU', 'PV1MAPE', 'PV2MAPE', 'PV3MAPE', 'PV4MAPE', 'PV5MAPE', 'PV1MAPF', 'PV2MAPF', 'PV3MAPF', 'PV4MAPF', 'PV5MAPF', 'PV1MAPI', 'PV2MAPI', 'PV3MAPI', 'PV4MAPI', 'PV5MAPI'

reading

'PV1READ', 'PV2READ', 'PV3READ', 'PV4READ', 'PV5READ'

science

'PV1SCIE', 'PV2SCIE', 'PV3SCIE', 'PV4SCIE', 'PV5SCIE'

In [29]:
# collect the column names:
mathematics = ['PV1MATH', 'PV2MATH', 'PV3MATH', 'PV4MATH', 'PV5MATH',
               'PV1MACC', 'PV2MACC', 'PV3MACC', 'PV4MACC', 'PV5MACC',
               'PV1MACQ', 'PV2MACQ', 'PV3MACQ', 'PV4MACQ', 'PV5MACQ',
               'PV1MACS', 'PV2MACS', 'PV3MACS', 'PV4MACS', 'PV5MACS',
               'PV1MACU', 'PV2MACU', 'PV3MACU', 'PV4MACU', 'PV5MACU',
               'PV1MAPE', 'PV2MAPE', 'PV3MAPE', 'PV4MAPE', 'PV5MAPE',
               'PV1MAPF', 'PV2MAPF', 'PV3MAPF', 'PV4MAPF', 'PV5MAPF',
               'PV1MAPI', 'PV2MAPI', 'PV3MAPI', 'PV4MAPI', 'PV5MAPI']

reading = ['PV1READ', 'PV2READ', 'PV3READ', 'PV4READ', 'PV5READ']

science = ['PV1SCIE', 'PV2SCIE', 'PV3SCIE', 'PV4SCIE', 'PV5SCIE']
In [30]:
# convert the columns from strings to float
all_PV = mathematics + reading + science

for col in all_PV:
    df[col] = df[col].astype('float')
    
print('Done!')
Done!
In [31]:
# verify dtype
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 314831 entries, 0 to 314830
Data columns (total 84 columns):
CNT         314831 non-null object
OECD        314831 non-null object
STIDSTD     314831 non-null object
ST04Q01     314831 non-null object
ST13Q01     295559 non-null object
ST15Q01     303438 non-null object
ST17Q01     285911 non-null object
ST19Q01     293592 non-null object
ST20Q01     308622 non-null object
ST25Q01     300461 non-null object
EC05Q01     95301 non-null object
EC06Q01     30918 non-null object
EC07Q01     30299 non-null object
EC07Q02     29819 non-null object
EASY        314831 non-null object
ATSCHL      202601 non-null object
ATTLNACT    202093 non-null object
BELONG      203134 non-null object
CLSMAN      202822 non-null object
COGACT      204216 non-null object
ESCS        309909 non-null object
HEDRES      309882 non-null object
HISCED      305922 non-null object
INSTMOT     205531 non-null object
INTMAT      205712 non-null object
LANGCOMM    30540 non-null object
LANGN       312263 non-null object
LANGRPPD    29791 non-null object
MATWKETH    204331 non-null object
MTSUP       203436 non-null object
SCMAT       204183 non-null object
STUDREL     203429 non-null object
SUBNORM     205532 non-null object
TESTLANG    314218 non-null object
PV1MATH     314831 non-null float64
PV2MATH     314831 non-null float64
PV3MATH     314831 non-null float64
PV4MATH     314831 non-null float64
PV5MATH     314831 non-null float64
PV1MACC     309853 non-null float64
PV2MACC     309853 non-null float64
PV3MACC     309853 non-null float64
PV4MACC     309853 non-null float64
PV5MACC     309853 non-null float64
PV1MACQ     309853 non-null float64
PV2MACQ     309853 non-null float64
PV3MACQ     309853 non-null float64
PV4MACQ     309853 non-null float64
PV5MACQ     309853 non-null float64
PV1MACS     309853 non-null float64
PV2MACS     309853 non-null float64
PV3MACS     309853 non-null float64
PV4MACS     309853 non-null float64
PV5MACS     309853 non-null float64
PV1MACU     309853 non-null float64
PV2MACU     309853 non-null float64
PV3MACU     309853 non-null float64
PV4MACU     309853 non-null float64
PV5MACU     309853 non-null float64
PV1MAPE     300780 non-null float64
PV2MAPE     300780 non-null float64
PV3MAPE     300780 non-null float64
PV4MAPE     300780 non-null float64
PV5MAPE     300780 non-null float64
PV1MAPF     300780 non-null float64
PV2MAPF     300780 non-null float64
PV3MAPF     300780 non-null float64
PV4MAPF     300780 non-null float64
PV5MAPF     300780 non-null float64
PV1MAPI     300780 non-null float64
PV2MAPI     300780 non-null float64
PV3MAPI     300780 non-null float64
PV4MAPI     300780 non-null float64
PV5MAPI     300780 non-null float64
PV1READ     314831 non-null float64
PV2READ     314831 non-null float64
PV3READ     314831 non-null float64
PV4READ     314831 non-null float64
PV5READ     314831 non-null float64
PV1SCIE     314831 non-null float64
PV2SCIE     314831 non-null float64
PV3SCIE     314831 non-null float64
PV4SCIE     314831 non-null float64
PV5SCIE     314831 non-null float64
dtypes: float64(50), object(34)
memory usage: 201.8+ MB
In [32]:
# convert the values for mathematics and turn into categories

# define conversion function
def conversion_math (x):
    if x == np.nan:
        return x
    if x >= 669.3:
        return '6'
    if 607.0 <= x < 669.3:
        return '5'
    if 544.7 <= x < 607.0:
        return '4'
    if 482.4 <= x < 544.7:
        return '3'
    if 420.1 <= x < 482.4:
        return '2'
    if 357.8 <= x < 420.1:
        return '1'
    if x < 357.8:
        return 'Below 1'

# create the categories
values_math = ['Below 1', '1', '2', '3', '4', '5', '6']
ordered_math = pd.api.types.CategoricalDtype(ordered=True, categories=values_math)

print('Working on this..')

# apply
for col in mathematics:
    df[col] = df[col].apply(conversion_math)
    df[col] = df[col].astype(ordered_math)

print('Done!')
Working on this..
Done!
In [33]:
# convert the values for reading and turn into categories

# define conversion function
def conversion_reading (x):
    if x == np.nan:
        return x
    if x > 698.32:
        return '6'
    if 625.61 < x <= 698.32:
        return '5'
    if 552.89 < x <= 625.61:
        return '4'
    if 480.18 < x <= 552.89:
        return '3'
    if 407.47 < x <= 480.18:
        return '2'
    if 334.75 < x <= 407.47:
        return '1a'
    if 262.04 < x <= 334.75:
        return '1b'
    if x <= 262.04:
        return 'Too low' # my addition, to not leave values uncategorized

    
# create the categories
values_reading = ['Too low', '1b', '1a', '2', '3', '4', '5', '6']
ordered_reading = pd.api.types.CategoricalDtype(ordered=True, categories=values_reading)

print('Working on this..')

# apply
for col in reading:
    df[col] = df[col].apply(conversion_reading)
    df[col] = df[col].astype(ordered_reading)

print('Done!')
Working on this..
Done!
In [34]:
# convert the values for science and turn into categories

# define conversion function
def conversion_science (x):
    if x == np.nan:
        return x
    if x > 707.9:
        return '6'
    if 633.3 < x <= 707.9:
        return '5'
    if 558.7 < x <= 633.3:
        return '4'
    if 484.1 < x <= 558.7:
        return '3'
    if 409.5 < x <= 484.1:
        return '2'
    if 334.9 < x <= 409.5:
        return '1'
    if x <= 334.9:
        return 'Too low' # my addition, to not leave values uncategorized
    
# create the categories
values_science = ['Too low', '1', '2', '3', '4', '5', '6']
ordered_science = pd.api.types.CategoricalDtype(ordered=True, categories=values_science)

print('Working on this..')

# apply
for col in science:
    df[col] = df[col].apply(conversion_science)
    df[col] = df[col].astype(ordered_science)
    
print('Done!')
Working on this..
Done!
In [35]:
df[['PV3MATH', 'PV1READ', 'PV4SCIE']]
Out[35]:
PV3MATH PV1READ PV4SCIE
0 Below 1 1b 1
1 Below 1 1b Too low
2 Below 1 1b Too low
3 Below 1 Too low Too low
4 Below 1 1b 1
... ... ... ...
314826 2 2 3
314827 3 3 4
314828 3 2 3
314829 3 3 3
314830 2 3 2

314831 rows × 3 columns

In [36]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 314831 entries, 0 to 314830
Data columns (total 84 columns):
CNT         314831 non-null object
OECD        314831 non-null object
STIDSTD     314831 non-null object
ST04Q01     314831 non-null object
ST13Q01     295559 non-null object
ST15Q01     303438 non-null object
ST17Q01     285911 non-null object
ST19Q01     293592 non-null object
ST20Q01     308622 non-null object
ST25Q01     300461 non-null object
EC05Q01     95301 non-null object
EC06Q01     30918 non-null object
EC07Q01     30299 non-null object
EC07Q02     29819 non-null object
EASY        314831 non-null object
ATSCHL      202601 non-null object
ATTLNACT    202093 non-null object
BELONG      203134 non-null object
CLSMAN      202822 non-null object
COGACT      204216 non-null object
ESCS        309909 non-null object
HEDRES      309882 non-null object
HISCED      305922 non-null object
INSTMOT     205531 non-null object
INTMAT      205712 non-null object
LANGCOMM    30540 non-null object
LANGN       312263 non-null object
LANGRPPD    29791 non-null object
MATWKETH    204331 non-null object
MTSUP       203436 non-null object
SCMAT       204183 non-null object
STUDREL     203429 non-null object
SUBNORM     205532 non-null object
TESTLANG    314218 non-null object
PV1MATH     314831 non-null category
PV2MATH     314831 non-null category
PV3MATH     314831 non-null category
PV4MATH     314831 non-null category
PV5MATH     314831 non-null category
PV1MACC     309853 non-null category
PV2MACC     309853 non-null category
PV3MACC     309853 non-null category
PV4MACC     309853 non-null category
PV5MACC     309853 non-null category
PV1MACQ     309853 non-null category
PV2MACQ     309853 non-null category
PV3MACQ     309853 non-null category
PV4MACQ     309853 non-null category
PV5MACQ     309853 non-null category
PV1MACS     309853 non-null category
PV2MACS     309853 non-null category
PV3MACS     309853 non-null category
PV4MACS     309853 non-null category
PV5MACS     309853 non-null category
PV1MACU     309853 non-null category
PV2MACU     309853 non-null category
PV3MACU     309853 non-null category
PV4MACU     309853 non-null category
PV5MACU     309853 non-null category
PV1MAPE     300780 non-null category
PV2MAPE     300780 non-null category
PV3MAPE     300780 non-null category
PV4MAPE     300780 non-null category
PV5MAPE     300780 non-null category
PV1MAPF     300780 non-null category
PV2MAPF     300780 non-null category
PV3MAPF     300780 non-null category
PV4MAPF     300780 non-null category
PV5MAPF     300780 non-null category
PV1MAPI     300780 non-null category
PV2MAPI     300780 non-null category
PV3MAPI     300780 non-null category
PV4MAPI     300780 non-null category
PV5MAPI     300780 non-null category
PV1READ     314831 non-null category
PV2READ     314831 non-null category
PV3READ     314831 non-null category
PV4READ     314831 non-null category
PV5READ     314831 non-null category
PV1SCIE     314831 non-null category
PV2SCIE     314831 non-null category
PV3SCIE     314831 non-null category
PV4SCIE     314831 non-null category
PV5SCIE     314831 non-null category
dtypes: category(50), object(34)
memory usage: 96.7+ MB

Column names

In [37]:
# all my chosen column descriptions are in the var description, the name are in the var columns_to_keep
# unfortunately I cannot zip them, because the order is not the same

# create a subset of pisa_variables
my_columns = pisa_variables[pisa_variables['code'].isin(columns_to_keep)].copy()
In [38]:
# clean the descriptions a little bit
# sorry for the superlong lines, but .replace() was not working and I decided to look for the reason at a later moment
my_columns['x'] = my_columns.x.str.replace(' - ', '_').str.replace(' ','_').str.replace("'s",'').str.replace('-','_').str.replace('<','_').str.replace('>','').str.replace(',','').str.replace('__','_')
my_columns['x'] = my_columns.x.str.replace('_value_', '_level_').str.replace('content_subscale_of_','').str.replace('process_subscale_of_','').str.replace('_code_3_character', '').str.replace('_.3_digit_code.', '')
In [39]:
# create a dictionary of the current column names and the cleaned descriptions, set the latter as new names
columns_dict = dict(zip(my_columns['code'], my_columns['x']))

df.rename(columns=columns_dict, inplace=True)

# check up
df.head()
Out[39]:
Country OECD_country Student_ID Gender Mother_Highest_Schooling Mother_Current_Job_Status Father_Highest_Schooling Father_Current_Job_Status Country_of_Birth_International_Self International_Language_at_Home ... Plausible_level_1_in_reading Plausible_level_2_in_reading Plausible_level_3_in_reading Plausible_level_4_in_reading Plausible_level_5_in_reading Plausible_level_1_in_science Plausible_level_2_in_science Plausible_level_3_in_science Plausible_level_4_in_science Plausible_level_5_in_science
0 United Arab Emirates Non-OECD 00001 Male <ISCED level 3A> Other (e.g. home duties, retired) <ISCED level 3A> Working full-time <for pay> Other country Language of the test ... 1b 1b 1b 1b Too low 1 1 1 1 1
1 United Arab Emirates Non-OECD 00002 Male <ISCED level 3A> Other (e.g. home duties, retired) <ISCED level 3A> Not working, but looking for a job Country of test Language of the test ... 1b Too low Too low Too low Too low Too low Too low Too low Too low Too low
2 United Arab Emirates Non-OECD 00003 Male NaN Other (e.g. home duties, retired) <ISCED level 2> Working part-time <for pay> Country of test Language of the test ... 1b Too low Too low Too low 1b Too low Too low Too low Too low Too low
3 United Arab Emirates Non-OECD 00004 Male <ISCED level 3A> Other (e.g. home duties, retired) NaN Working full-time <for pay> Other country Language of the test ... Too low Too low Too low Too low Too low Too low Too low Too low Too low Too low
4 United Arab Emirates Non-OECD 00005 Male <ISCED level 3A> Other (e.g. home duties, retired) <ISCED level 3A> Other (e.g. home duties, retired) Country of test Language of the test ... 1b 1b 1b 1b Too low 1 1 1 1 2

5 rows × 84 columns

Now it is easier to decide which columns have the wrong datatype

In [40]:
# parents schooling

df.Mother_Highest_Schooling.unique()
Out[40]:
array(['<ISCED level 3A> ', nan, '<ISCED level 2> ',
       '<ISCED level 3B, 3C> ', 'She did not complete <ISCED level 1> ',
       '<ISCED level 1> '], dtype=object)
In [41]:
df.Father_Highest_Schooling.unique()
Out[41]:
array(['<ISCED level 3A> ', '<ISCED level 2> ', nan,
       '<ISCED level 3B, 3C> ', '<ISCED level 1> ',
       'He did not complete <ISCED level 1> '], dtype=object)
In [42]:
df.Highest_educational_level_of_parents.unique()
Out[42]:
array(['ISCED 5A, 6', 'ISCED 5B', 'ISCED 3A, ISCED 4', nan, 'ISCED 2',
       'None', 'ISCED 3B, C', 'ISCED 1'], dtype=object)

From the Thechnical report, p. 307

Indices on parental education were constructed by recoding educational qualifications into the following categories: (0) None, (1) ISCED 1 (primary education), (2) ISCED 2 (lower secondary), (3) ISCED Level 3B or 3C (vocational/pre-vocational upper secondary), (4) ISCED 3A (general upper secondary) and/or ISCED 4 (non-tertiary post-secondary), (5) ISCED 5B (vocational tertiary) and (6) ISCED 5A, 6 (theoretically oriented tertiary and post-graduate).

These three columns need to be converted to category

I will first, in the Mother and Father columns

  • eliminate the "<", ">" and "level",
  • turn the "he/she did not complete <ISCED level 1>" into "None", and
  • replace 'ISCED 3A' with 'ISCED 3A, 4'
  • strip the trailing whitespaces

and, in the parents schooling

  • replace 'ISCED 3B, C' with 'ISCED 3B, 3C'
  • replace 'ISCED 3A, ISCED 4' with 'ISCED 3A, 4'
  • strip the trailing whitespaces
In [43]:
df.Mother_Highest_Schooling = df.Mother_Highest_Schooling.str.replace("She did not complete <ISCED level 1>", "None").str.replace('<', '').str.replace('>', '').str.replace(' level ', ' ').str.replace('ISCED 3A', 'ISCED 3A, 4').str.strip()
df.Father_Highest_Schooling = df.Father_Highest_Schooling.str.replace("He did not complete <ISCED level 1>", "None").str.replace('<', '').str.replace('>', '').str.replace(' level ', ' ').str.replace('ISCED 3A', 'ISCED 3A, 4').str.strip()
df.Highest_educational_level_of_parents = df.Highest_educational_level_of_parents.str.replace('ISCED 3B, C', 'ISCED 3B, 3C').str.replace('ISCED 3A, ISCED 4', 'ISCED 3A, 4').str.strip()
In [44]:
# prepare the category
education_values = ['None', 'ISCED 1', 'ISCED 2', 'ISCED 3B, 3C', 'ISCED 3A, 4', 'ISCED 5B', 'ISCED 5A, 6']
ordered_ed_values = pd.api.types.CategoricalDtype(ordered=True, categories=education_values)
In [45]:
# apply them
df.Mother_Highest_Schooling = df.Mother_Highest_Schooling.astype(ordered_ed_values)
df.Father_Highest_Schooling = df.Father_Highest_Schooling.astype(ordered_ed_values)
df.Highest_educational_level_of_parents = df.Highest_educational_level_of_parents.astype(ordered_ed_values)

Let's keep on looking at variables

In [46]:
df.iloc[:, 7:12].head()
Out[46]:
Father_Current_Job_Status Country_of_Birth_International_Self International_Language_at_Home First_language_learned Age_started_learning_test_language
0 Working full-time <for pay> Other country Language of the test NaN NaN
1 Not working, but looking for a job Country of test Language of the test NaN NaN
2 Working part-time <for pay> Country of test Language of the test NaN NaN
3 Working full-time <for pay> Other country Language of the test NaN NaN
4 Other (e.g. home duties, retired) Country of test Language of the test NaN NaN
In [47]:
df.Father_Current_Job_Status.unique()
Out[47]:
array(['Working full-time <for pay> ',
       'Not working, but looking for a job',
       'Working part-time <for pay>',
       'Other (e.g. home duties, retired) ', nan], dtype=object)
In [48]:
df.International_Language_at_Home.unique()
Out[48]:
array(['Language of the test', nan, 'Other language'], dtype=object)
In [49]:
df.First_language_learned.unique()
Out[49]:
array([nan,
       '<test language> or <other official national language(s) or d',
       'other than <test language> or <other official national langu'],
      dtype=object)
In [50]:
df.Age_started_learning_test_language.unique()
Out[50]:
array([nan, '0 to 3 years', '4 to 6 years', '10 to 12 years',
       '7 to 9 years', '13 years or older'], dtype=object)
In [51]:
# the last one can become an ordered category
learning_test_language = ['0 to 3 years', '4 to 6 years', '10 to 12 years', '7 to 9 years', '13 years or older']
ordered_learning_lang = pd.api.types.CategoricalDtype(ordered=True, categories=learning_test_language)
df.Age_started_learning_test_language = df.Age_started_learning_test_language.astype(ordered_learning_lang)

Of this group, df.iloc[:, 7:12], the Age_started_learning_test_language column is the only one that needed to be converted into a category. The other variables don't have an order and have few values, so they can remain strings.

In [52]:
df.iloc[:, 12:17].head()
Out[52]:
Language_spoken_Mother Language_spoken_Father Standard_or_simplified_set_of_booklets Attitude_towards_School:_Learning_Outcomes Attitude_towards_School:_Learning_Activities
0 NaN NaN Easier set of booklets -1.96 -1.7015
1 NaN NaN Easier set of booklets -0.95 -0.9394
2 NaN NaN Easier set of booklets 0.77 -0.9394
3 NaN NaN Easier set of booklets NaN NaN
4 NaN NaN Easier set of booklets -0.95 -0.9394
In [53]:
df.Language_spoken_Mother.unique()
Out[53]:
array([nan, 'Mostly <heritage language> ', 'Mostly <test language>',
       'About equally <heritage language> and <test language>',
       'Not applicable'], dtype=object)
In [54]:
df.Language_spoken_Mother.isna().sum()
Out[54]:
284532
In [55]:
df.Language_spoken_Father.unique()
Out[55]:
array([nan, 'Mostly <heritage language> ',
       'About equally <heritage language> and <test language>',
       'Not applicable', 'Mostly <test language>'], dtype=object)
In [56]:
df.Standard_or_simplified_set_of_booklets.unique()
Out[56]:
array(['Easier set of booklets', 'Standard set of booklets'], dtype=object)
In [57]:
df['Attitude_towards_School:_Learning_Outcomes'].describe()
Out[57]:
count     202601
unique        95
top        -0.24
freq       49588
Name: Attitude_towards_School:_Learning_Outcomes, dtype: object
In [58]:
df['Attitude_towards_School:_Learning_Activities'].describe()
Out[58]:
count     202093
unique        77
top       1.2115
freq       62578
Name: Attitude_towards_School:_Learning_Activities, dtype: object
In [59]:
# This last two columns need to be converted into float
df['Attitude_towards_School:_Learning_Outcomes'] = df['Attitude_towards_School:_Learning_Outcomes'].astype(float)
df['Attitude_towards_School:_Learning_Activities'] = df['Attitude_towards_School:_Learning_Activities'].astype(float)
In [60]:
df[['Attitude_towards_School:_Learning_Outcomes', 'Attitude_towards_School:_Learning_Activities']].describe()
Out[60]:
Attitude_towards_School:_Learning_Outcomes Attitude_towards_School:_Learning_Activities
count 202601.000000 202093.000000
mean 0.067141 0.044328
std 1.012920 0.998351
min -2.990000 -3.375800
25% -0.640000 -0.939400
50% -0.240000 0.087300
75% 0.770000 1.211500
max 2.350000 1.211500

In the group df.iloc[:, 12:17] only the last two columns needed conversion: from string to float. The language spoken at home could become an ordered category, but I will decide about it later.

In [61]:
# let's go on
df.iloc[:, 17:22]
Out[61]:
Sense_of_Belonging_to_School Mathematics_Teacher_Classroom_Management Cognitive_Activation_in_Mathematics_Lessons Index_of_economic_social_and_cultural_status Home_educational_resources
0 -1.3 0.3255 -0.4593 0.65 0.04
1 NaN -0.4499 3.2019 0.72 1.12
2 0.08 -1.6718 0.1015 -0.09 0.04
3 NaN NaN NaN 1.08 NaN
4 -0.9 -0.4499 3.2019 -0.24 1.12
... ... ... ... ... ...
314826 NaN NaN NaN -2.08 -0.69
314827 -1.13 -0.0784 -0.6041 -1.22 0.04
314828 -0.15 -0.0784 -0.7532 -1.92 -1.8
314829 -0.15 0.3255 0.2435 -0.71 0.04
314830 NaN NaN NaN -1.16 0.57

314831 rows × 5 columns

These can all be converted to float.

In [62]:
df['Sense_of_Belonging_to_School'] = df['Sense_of_Belonging_to_School'].astype(float)
df['Mathematics_Teacher_Classroom_Management'] = df['Mathematics_Teacher_Classroom_Management'].astype(float)
df['Cognitive_Activation_in_Mathematics_Lessons'] = df['Cognitive_Activation_in_Mathematics_Lessons'].astype(float)
df['Index_of_economic_social_and_cultural_status'] = df['Index_of_economic_social_and_cultural_status'].astype(float)
df['Home_educational_resources'] = df['Home_educational_resources'].astype(float)
In [63]:
df.iloc[:, 17:22].describe()
Out[63]:
Sense_of_Belonging_to_School Mathematics_Teacher_Classroom_Management Cognitive_Activation_in_Mathematics_Lessons Index_of_economic_social_and_cultural_status Home_educational_resources
count 203134.000000 202822.000000 204216.000000 309909.000000 309882.000000
mean -0.013157 0.071603 0.098791 -0.210149 -0.225419
std 0.994276 0.999293 1.031045 1.129318 1.063583
min -3.690000 -3.253000 -3.884000 -5.440000 -3.930000
25% -0.740000 -0.449900 -0.459300 -0.950000 -0.690000
50% -0.150000 -0.078400 0.101500 -0.120000 0.040000
75% 0.560000 0.764000 0.540300 0.660000 1.120000
max 2.630000 2.198900 3.201900 3.210000 1.120000
In [64]:
df.iloc[:, 22:27]
Out[64]:
Highest_educational_level_of_parents Instrumental_Motivation_for_Mathematics Mathematics_Interest Preference_for_Heritage_Language_in_Conversations_with_Family_and_Friends Language_at_home
0 ISCED 5A, 6 -1.37 1.23 NaN Arabic
1 ISCED 5A, 6 NaN NaN NaN Arabic
2 ISCED 5B NaN NaN NaN Arabic
3 ISCED 3A, 4 NaN NaN NaN Arabic
4 ISCED 3A, 4 1.59 2.29 NaN Arabic
... ... ... ... ... ...
314826 ISCED 2 0.5 0 NaN Vietnamese
314827 ISCED 2 0.05 0.91 NaN Vietnamese
314828 ISCED 2 -0.39 0.58 NaN Vietnamese
314829 ISCED 5A, 6 NaN NaN NaN Vietnamese
314830 ISCED 3A, 4 0.8 -0.66 NaN Vietnamese

314831 rows × 5 columns

In [65]:
# convert column 23 and 24 to float
df['Instrumental_Motivation_for_Mathematics'] = df['Instrumental_Motivation_for_Mathematics'].astype(float)
df['Mathematics_Interest'] = df['Mathematics_Interest'].astype(float)
In [66]:
df.iloc[:, 23:25].describe()
Out[66]:
Instrumental_Motivation_for_Mathematics Mathematics_Interest
count 205531.000000 205712.000000
mean 0.104347 0.188094
std 0.994595 1.008450
min -2.300000 -1.780000
25% -0.670000 -0.340000
50% 0.050000 0.300000
75% 0.800000 0.910000
max 1.590000 2.290000
In [67]:
df.Preference_for_Heritage_Language_in_Conversations_with_Family_and_Friends.unique()
Out[67]:
array([nan, '5', '1', '0', '4', '2', '3'], dtype=object)
In [68]:
# let's deal with this column together
df.Preference_for_Heritage_Language_in_Language_Reception_and_Production.unique()
Out[68]:
array([nan, '4', '0', '3', '2', '1'], dtype=object)
In [69]:
# convert the Preference_for_Heritage_Language_in_Conversations_with_Family_and_Friends to category
language_preference = ['0', '1', '2', '3', '4', '5']
ordered_lang_preference = pd.api.types.CategoricalDtype(ordered=True, categories=language_preference)
df.Preference_for_Heritage_Language_in_Conversations_with_Family_and_Friends = df.Preference_for_Heritage_Language_in_Conversations_with_Family_and_Friends.astype(ordered_lang_preference)
df.Preference_for_Heritage_Language_in_Language_Reception_and_Production = df.Preference_for_Heritage_Language_in_Language_Reception_and_Production.astype(ordered_lang_preference)
In [70]:
df.Preference_for_Heritage_Language_in_Language_Reception_and_Production.unique()
Out[70]:
[NaN, 4, 0, 3, 2, 1]
Categories (5, object): [0 < 1 < 2 < 3 < 4]
In [71]:
df['Language_at_home'].unique()
Out[71]:
array(['Arabic ', 'Invalid', 'Missing', 'English',
       'Another language (QRE) ', nan, 'Spanish', 'Portuguese ',
       'Another language (ARG) ', 'Indigenouslanguage (ARG) ',
       'Another language (AUS) ', 'Italian', 'German ', 'Cantonese',
       'Hindi', 'Mandarin ', 'Greek', 'Vietnamese ',
       'Australian languages ', 'Turkish', 'Serbian', 'Polish ',
       'Another language (AUT) ', 'Bosnian', 'Romanian ', 'Croatian ',
       'Macedonian ', 'Albanian ', 'Russian', 'Kurdish', 'French ',
       'Another language (BEL) ', 'Eastern European languages ',
       'Western European languages ', 'Flemish dialect (BEL)', 'Dutch',
       'Walloon', 'German dialect (BEL) ', 'Another language (CAN) ',
       'Swiss German ', 'Another language (CHE) ',
       'Yugoslavian - Serbian, Croatian, etc ', 'Swiss Italian',
       "Raeto'-Romance ", 'German (LIE) ', 'Another language (CHL) ',
       'Another language (COL) ', 'Another language (CRI) ',
       'Another language (DEU) ', 'Basque ', 'Another language (ESP) ',
       'Catalan', 'Galician ', 'Valencian', 'Finnish', 'Somali ',
       'Another language (FIN) ', 'Estonian ', 'Chinese', 'Thai ',
       'Swedish', 'Romani ', 'Another language (FRA) ',
       'Regional languages (FRA) ', 'Another language (QUK) ', 'Welsh',
       'Another language (QSC) ', 'Ulster Scots ', 'Irish',
       'Other European languages (QSC) ', 'Scottish Gaelic',
       'Another language (HKG) ', 'Chinese dialects or languages (HKG)',
       'Another language (IRL) ', 'Another EU language (ITA)',
       'A dialect (ITA)', 'Another language (ITA) ',
       'Another official language (ITA)', 'Slovenian', 'Japanese ',
       'Another language (JPN) ', 'Korean ', 'Luxembourgish',
       'Another language (LUX) ', 'Another language (MAC) ',
       'Chinese dialects or languages (MAC)', 'Indian Languages (MEX) ',
       'Another language (MEX) ', 'Another language (NZL) ', 'Maori',
       'Samoan ', 'Quechua', 'Amazonic language (PER)',
       'Another language (PER) ', 'Aymara ', 'Another language (QAT) ',
       'Shanghai dialect ', 'Another local Chinese dialect (QCN)',
       'Another language (QCN) ', 'Another language (USA) ', 'Tamil',
       'Malay', 'Another language (SGP) ', 'Taiwanese dialect',
       'Hakka dialect (TAP)', 'Another language (TAP) ',
       'Aboriginal dialect (TAP) ', 'Arabic dialect (TUN) ',
       'Another language (TUN) ', 'Portugués del Uruguay',
       'Another language (URY) ', 'Another Language (VNM) '], dtype=object)
In [72]:
# clean the trailing whitespaces
df.Language_at_home = df.Language_at_home.str.strip()
  • Instrumental_Motivation_for_Mathematics and Mathematics_Interest has been converted to float;
  • Preference_for_Heritage_Language_in_Conversations_with_Family_and_Friends and Preference_for_Heritage_Language_in_Language_Reception_and_Production to category;
  • Language_at_home can stay as it is for now, but I will probably need to change some values for the language I'm interested in if I want to use it: for instance, Italian is in 'Italian', 'Another EU language (ITA)', 'A dialect (ITA)', 'Another language (ITA) ', 'Another official language (ITA)' and 'Swiss Italian'
In [73]:
df.iloc[:, 27:32]
Out[73]:
Preference_for_Heritage_Language_in_Language_Reception_and_Production Mathematics_Work_Ethic Mathematics_Teacher_Support Mathematics_Self_Concept Teacher_Student_Relations
0 NaN -0.0651 -0.9508 0.18 -0.48
1 NaN NaN -0.2395 1.12 NaN
2 NaN NaN 1.8433 0.65 -0.02
3 NaN NaN NaN NaN NaN
4 NaN 2.7167 -0.2395 0.41 -0.02
... ... ... ... ... ...
314826 NaN -0.0651 NaN NaN NaN
314827 NaN 0.2882 1.1169 0.65 -1.26
314828 NaN -0.2356 -0.2395 0.18 -0.48
314829 NaN NaN 0.2486 0.18 -0.02
314830 NaN -0.4017 NaN NaN NaN

314831 rows × 5 columns

In [74]:
# first one has already been done,
# other 4 columns to float
df.iloc[:,28:32] = df.iloc[:, 28:32].astype(float)
In [75]:
df.iloc[:,28:32].describe()
Out[75]:
Mathematics_Work_Ethic Mathematics_Teacher_Support Mathematics_Self_Concept Teacher_Student_Relations
count 204331.000000 203436.000000 204183.000000 203429.000000
mean 0.145413 0.157205 0.037056 0.111293
std 1.011064 1.012511 0.975370 1.030987
min -3.450300 -2.864500 -2.180000 -3.110000
25% -0.401700 -0.657700 -0.520000 -0.480000
50% 0.109900 0.248600 -0.060000 -0.020000
75% 0.649000 1.116900 0.650000 0.810000
max 2.716700 1.843300 2.260000 2.160000
In [76]:
df.iloc[:,32:37]
Out[76]:
Subjective_Norms_in_Mathematics Language_of_the_test Plausible_level_1_in_mathematics Plausible_level_2_in_mathematics Plausible_level_3_in_mathematics
0 0.3026 Arabic Below 1 1 Below 1
1 NaN Arabic Below 1 Below 1 Below 1
2 NaN Arabic Below 1 Below 1 Below 1
3 NaN Arabic Below 1 Below 1 Below 1
4 0.3026 Arabic Below 1 Below 1 Below 1
... ... ... ... ... ...
314826 0.3026 Vietnamese 2 3 2
314827 -0.3852 Vietnamese 3 3 3
314828 -0.3852 Vietnamese 2 2 3
314829 NaN Vietnamese 4 3 3
314830 -0.0455 Vietnamese 2 2 2

314831 rows × 5 columns

In [77]:
# first column to float
df.Subjective_Norms_in_Mathematics = df.Subjective_Norms_in_Mathematics.astype(float)
df.Subjective_Norms_in_Mathematics.describe()
Out[77]:
count    205532.000000
mean          0.150539
std           1.061653
min          -4.245600
25%          -0.385200
50%          -0.045500
75%           0.660200
max           3.858700
Name: Subjective_Norms_in_Mathematics, dtype: float64
In [78]:
# Language_of_test: check the values 
df.Language_of_the_test.unique()
Out[78]:
array(['Arabic ', 'English', nan, 'Spanish', 'German ', 'French ',
       'Dutch', 'Italian', 'Catalan', 'Basque ', 'Valencian', 'Finnish',
       'Swedish', 'Welsh', 'Cantonese', 'Irish', 'Slovenian', 'Japanese ',
       'Korean ', 'Chinese', 'Portuguese ',
       'Hybrid - English + Arabic (QAT)', 'Shanghai dialect ',
       'Mandarin ', 'Vietnamese '], dtype=object)

Language of the test is ok as string, but I can drop the rows where the test is not in one of my chosen languages. The groups are:

  • shallow_ortography: Spanish, Finnish, Italian, German

  • deep_ortography: English, French, Arabic

  • logographic: Chinese, Japanese, Korean To this I will add Shanghai dialect, Mandarin and Cantonese, because are all written with chinese characters.

Since English and Arabic belong to the same group, I will keep the "Hybrid - English + Arabic (QAT)" group, renaming it as 'English_Arabic'.

In [79]:
shallow_ortography =['Spanish', 'Finnish', 'Italian', 'German']
deep_ortography = ['English', 'French', 'Arabic', 'English_Arabic']
logographic = ['Chinese', 'Japanese', 'Korean', 'Shanghai dialect', 'Mandarin', 'Cantonese']

test_lang_to_keep = shallow_ortography + deep_ortography + logographic
In [80]:
# clean the test language labels
df.Language_of_the_test = df.Language_of_the_test.str.strip().str.replace('Hybrid.*', 'English_Arabic')
In [81]:
# check values
df.Language_of_the_test.unique()
Out[81]:
array(['Arabic', 'English', nan, 'Spanish', 'German', 'French', 'Dutch',
       'Italian', 'Catalan', 'Basque', 'Valencian', 'Finnish', 'Swedish',
       'Welsh', 'Cantonese', 'Irish', 'Slovenian', 'Japanese', 'Korean',
       'Chinese', 'Portuguese', 'English_Arabic', 'Shanghai dialect',
       'Mandarin', 'Vietnamese'], dtype=object)
In [82]:
# drop all rows where the test is not in test_lang_to_keep
df = df.query('Language_of_the_test in @test_lang_to_keep')
In [83]:
# just to be sure that only plausible levels are left
df.iloc[:,34:].columns
Out[83]:
Index(['Plausible_level_1_in_mathematics', 'Plausible_level_2_in_mathematics',
       'Plausible_level_3_in_mathematics', 'Plausible_level_4_in_mathematics',
       'Plausible_level_5_in_mathematics',
       'Plausible_level_1_in_math_Change_and_Relationships',
       'Plausible_level_2_in_math_Change_and_Relationships',
       'Plausible_level_3_in_math_Change_and_Relationships',
       'Plausible_level_4_in_math_Change_and_Relationships',
       'Plausible_level_5_in_math_Change_and_Relationships',
       'Plausible_level_1_in_math_Quantity',
       'Plausible_level_2_in_math_Quantity',
       'Plausible_level_3_in_math_Quantity',
       'Plausible_level_4_in_math_Quantity',
       'Plausible_level_5_in_math_Quantity',
       'Plausible_level_1_in_math_Space_and_Shape',
       'Plausible_level_2_in_math_Space_and_Shape',
       'Plausible_level_3_in_math_Space_and_Shape',
       'Plausible_level_4_in_math_Space_and_Shape',
       'Plausible_level_5_in_math_Space_and_Shape',
       'Plausible_level_1_in_math_Uncertainty_and_Data',
       'Plausible_level_2_in_math_Uncertainty_and_Data',
       'Plausible_level_3_in_math_Uncertainty_and_Data',
       'Plausible_level_4_in_math_Uncertainty_and_Data',
       'Plausible_level_5_in_math_Uncertainty_and_Data',
       'Plausible_level_1_in_math_Employ', 'Plausible_level_2_in_math_Employ',
       'Plausible_level_3_in_math_Employ', 'Plausible_level_4_in_math_Employ',
       'Plausible_level_5_in_math_Employ',
       'Plausible_level_1_in_math_Formulate',
       'Plausible_level_2_in_math_Formulate',
       'Plausible_level_3_in_math_Formulate',
       'Plausible_level_4_in_math_Formulate',
       'Plausible_level_5_in_math_Formulate',
       'Plausible_level_1_in_math_Interpret',
       'Plausible_level_2_in_math_Interpret',
       'Plausible_level_3_in_math_Interpret',
       'Plausible_level_4_in_math_Interpret',
       'Plausible_level_5_in_math_Interpret', 'Plausible_level_1_in_reading',
       'Plausible_level_2_in_reading', 'Plausible_level_3_in_reading',
       'Plausible_level_4_in_reading', 'Plausible_level_5_in_reading',
       'Plausible_level_1_in_science', 'Plausible_level_2_in_science',
       'Plausible_level_3_in_science', 'Plausible_level_4_in_science',
       'Plausible_level_5_in_science'],
      dtype='object')
In [84]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 298594 entries, 0 to 309871
Data columns (total 84 columns):
Country                                                                      298594 non-null object
OECD_country                                                                 298594 non-null object
Student_ID                                                                   298594 non-null object
Gender                                                                       298594 non-null object
Mother_Highest_Schooling                                                     280683 non-null category
Mother_Current_Job_Status                                                    288071 non-null object
Father_Highest_Schooling                                                     271374 non-null category
Father_Current_Job_Status                                                    278630 non-null object
Country_of_Birth_International_Self                                          293039 non-null object
International_Language_at_Home                                               285151 non-null object
First_language_learned                                                       89133 non-null object
Age_started_learning_test_language                                           29903 non-null category
Language_spoken_Mother                                                       29321 non-null object
Language_spoken_Father                                                       28843 non-null object
Standard_or_simplified_set_of_booklets                                       298594 non-null object
Attitude_towards_School:_Learning_Outcomes                                   192377 non-null float64
Attitude_towards_School:_Learning_Activities                                 191875 non-null float64
Sense_of_Belonging_to_School                                                 192899 non-null float64
Mathematics_Teacher_Classroom_Management                                     192608 non-null float64
Cognitive_Activation_in_Mathematics_Lessons                                  193974 non-null float64
Index_of_economic_social_and_cultural_status                                 294289 non-null float64
Home_educational_resources                                                   294250 non-null float64
Highest_educational_level_of_parents                                         290593 non-null category
Instrumental_Motivation_for_Mathematics                                      195270 non-null float64
Mathematics_Interest                                                         195450 non-null float64
Preference_for_Heritage_Language_in_Conversations_with_Family_and_Friends    29544 non-null category
Language_at_home                                                             296565 non-null object
Preference_for_Heritage_Language_in_Language_Reception_and_Production        28814 non-null category
Mathematics_Work_Ethic                                                       194110 non-null float64
Mathematics_Teacher_Support                                                  193216 non-null float64
Mathematics_Self_Concept                                                     193941 non-null float64
Teacher_Student_Relations                                                    193185 non-null float64
Subjective_Norms_in_Mathematics                                              195273 non-null float64
Language_of_the_test                                                         298594 non-null object
Plausible_level_1_in_mathematics                                             298594 non-null category
Plausible_level_2_in_mathematics                                             298594 non-null category
Plausible_level_3_in_mathematics                                             298594 non-null category
Plausible_level_4_in_mathematics                                             298594 non-null category
Plausible_level_5_in_mathematics                                             298594 non-null category
Plausible_level_1_in_math_Change_and_Relationships                           293641 non-null category
Plausible_level_2_in_math_Change_and_Relationships                           293641 non-null category
Plausible_level_3_in_math_Change_and_Relationships                           293641 non-null category
Plausible_level_4_in_math_Change_and_Relationships                           293641 non-null category
Plausible_level_5_in_math_Change_and_Relationships                           293641 non-null category
Plausible_level_1_in_math_Quantity                                           293641 non-null category
Plausible_level_2_in_math_Quantity                                           293641 non-null category
Plausible_level_3_in_math_Quantity                                           293641 non-null category
Plausible_level_4_in_math_Quantity                                           293641 non-null category
Plausible_level_5_in_math_Quantity                                           293641 non-null category
Plausible_level_1_in_math_Space_and_Shape                                    293641 non-null category
Plausible_level_2_in_math_Space_and_Shape                                    293641 non-null category
Plausible_level_3_in_math_Space_and_Shape                                    293641 non-null category
Plausible_level_4_in_math_Space_and_Shape                                    293641 non-null category
Plausible_level_5_in_math_Space_and_Shape                                    293641 non-null category
Plausible_level_1_in_math_Uncertainty_and_Data                               293641 non-null category
Plausible_level_2_in_math_Uncertainty_and_Data                               293641 non-null category
Plausible_level_3_in_math_Uncertainty_and_Data                               293641 non-null category
Plausible_level_4_in_math_Uncertainty_and_Data                               293641 non-null category
Plausible_level_5_in_math_Uncertainty_and_Data                               293641 non-null category
Plausible_level_1_in_math_Employ                                             284568 non-null category
Plausible_level_2_in_math_Employ                                             284568 non-null category
Plausible_level_3_in_math_Employ                                             284568 non-null category
Plausible_level_4_in_math_Employ                                             284568 non-null category
Plausible_level_5_in_math_Employ                                             284568 non-null category
Plausible_level_1_in_math_Formulate                                          284568 non-null category
Plausible_level_2_in_math_Formulate                                          284568 non-null category
Plausible_level_3_in_math_Formulate                                          284568 non-null category
Plausible_level_4_in_math_Formulate                                          284568 non-null category
Plausible_level_5_in_math_Formulate                                          284568 non-null category
Plausible_level_1_in_math_Interpret                                          284568 non-null category
Plausible_level_2_in_math_Interpret                                          284568 non-null category
Plausible_level_3_in_math_Interpret                                          284568 non-null category
Plausible_level_4_in_math_Interpret                                          284568 non-null category
Plausible_level_5_in_math_Interpret                                          284568 non-null category
Plausible_level_1_in_reading                                                 298594 non-null category
Plausible_level_2_in_reading                                                 298594 non-null category
Plausible_level_3_in_reading                                                 298594 non-null category
Plausible_level_4_in_reading                                                 298594 non-null category
Plausible_level_5_in_reading                                                 298594 non-null category
Plausible_level_1_in_science                                                 298594 non-null category
Plausible_level_2_in_science                                                 298594 non-null category
Plausible_level_3_in_science                                                 298594 non-null category
Plausible_level_4_in_science                                                 298594 non-null category
Plausible_level_5_in_science                                                 298594 non-null category
dtypes: category(56), float64(14), object(14)
memory usage: 82.0+ MB

Here we go! It looks good enough to me to start some exploration.

I need to know a bit more on the dataset to decide:

  • how to deal with missing data;
  • if I want to modify Language_at_home (for instance, Italian is in 'Italian', 'Another EU language (ITA)', 'A dialect (ITA)', 'Another language (ITA) ', 'Another official language (ITA)' and 'Swiss Italian')

Since I will surely modify the dataset, I'll go with a copy.

In [85]:
# copy the df
clean_df = df.copy()

Univariate Exploration

First, I'll have a look at the different plausible levels

In [86]:
# do the value change a lot among columns of the same scale?
clean_df.iloc[:,34:].head(20)
Out[86]:
Plausible_level_1_in_mathematics Plausible_level_2_in_mathematics Plausible_level_3_in_mathematics Plausible_level_4_in_mathematics Plausible_level_5_in_mathematics Plausible_level_1_in_math_Change_and_Relationships Plausible_level_2_in_math_Change_and_Relationships Plausible_level_3_in_math_Change_and_Relationships Plausible_level_4_in_math_Change_and_Relationships Plausible_level_5_in_math_Change_and_Relationships ... Plausible_level_1_in_reading Plausible_level_2_in_reading Plausible_level_3_in_reading Plausible_level_4_in_reading Plausible_level_5_in_reading Plausible_level_1_in_science Plausible_level_2_in_science Plausible_level_3_in_science Plausible_level_4_in_science Plausible_level_5_in_science
0 Below 1 1 Below 1 Below 1 Below 1 1 Below 1 1 1 Below 1 ... 1b 1b 1b 1b Too low 1 1 1 1 1
1 Below 1 Below 1 Below 1 Below 1 Below 1 1 Below 1 Below 1 1 Below 1 ... 1b Too low Too low Too low Too low Too low Too low Too low Too low Too low
2 Below 1 Below 1 Below 1 Below 1 Below 1 1 Below 1 Below 1 Below 1 Below 1 ... 1b Too low Too low Too low 1b Too low Too low Too low Too low Too low
3 Below 1 Below 1 Below 1 Below 1 Below 1 Below 1 Below 1 Below 1 Below 1 Below 1 ... Too low Too low Too low Too low Too low Too low Too low Too low Too low Too low
4 Below 1 Below 1 Below 1 Below 1 Below 1 Below 1 Below 1 Below 1 Below 1 Below 1 ... 1b 1b 1b 1b Too low 1 1 1 1 2
5 Below 1 Below 1 Below 1 Below 1 Below 1 Below 1 Below 1 Below 1 Below 1 Below 1 ... Too low Too low Too low Too low Too low Too low Too low Too low Too low 1
6 1 2 2 1 2 2 2 2 3 2 ... 2 2 2 1a 2 1 1 2 2 2
7 Below 1 Below 1 Below 1 Below 1 Below 1 Below 1 Below 1 Below 1 Below 1 Below 1 ... Too low Too low Too low Too low Too low Too low Too low Too low Too low Too low
8 Below 1 Below 1 Below 1 Below 1 Below 1 Below 1 Below 1 Below 1 Below 1 Below 1 ... 1b 1b 1b 1b 1b Too low 1 1 1 1
9 Below 1 Below 1 Below 1 Below 1 Below 1 1 Below 1 1 Below 1 Below 1 ... Too low Too low Too low 1b Too low Too low Too low Too low 1 Too low
10 1 Below 1 Below 1 1 Below 1 1 1 1 1 1 ... 1a 1b 1b 1b 1b 1 Too low 1 1 Too low
11 1 2 1 2 1 1 2 3 2 2 ... 1a 1a 1b 1a 1b 2 2 2 2 2
12 1 1 Below 1 1 2 1 Below 1 1 1 1 ... 2 1a 1a 2 1a 2 2 1 2 2
13 2 3 3 2 2 2 1 2 1 1 ... 2 2 2 1a 1a 3 3 3 2 2
14 Below 1 Below 1 1 1 Below 1 Below 1 Below 1 Below 1 Below 1 1 ... 1b 1b 1a 1a 1a 2 2 2 2 1
15 Below 1 Below 1 Below 1 Below 1 Below 1 Below 1 Below 1 Below 1 Below 1 Below 1 ... Too low Too low Too low Too low Too low Too low Too low Too low Too low Too low
16 Below 1 Below 1 Below 1 Below 1 Below 1 Below 1 Below 1 Below 1 Below 1 Below 1 ... Too low Too low Too low Too low Too low Too low Too low Too low Too low Too low
17 Below 1 Below 1 Below 1 Below 1 Below 1 Below 1 Below 1 Below 1 Below 1 Below 1 ... Too low Too low Too low Too low Too low Too low Too low Too low Too low Too low
18 1 1 1 1 1 Below 1 Below 1 Below 1 Below 1 Below 1 ... 1b 1b Too low 1b 1a 1 1 1 1 1
19 1 1 1 Below 1 1 1 1 1 1 Below 1 ... 1b 1b Too low 1b 1b 1 Too low Too low Too low 1

20 rows × 50 columns

In [87]:
plt.figure(figsize=(20,6))

plt.subplot(1,5,1)
sb.countplot(clean_df.iloc[:,34], color=sb.color_palette()[0])

plt.subplot(1,5,2)
sb.countplot(clean_df.iloc[:,35], color=sb.color_palette()[0]);

plt.subplot(1,5,3)
sb.countplot(clean_df.iloc[:,36], color=sb.color_palette()[0]);

plt.subplot(1,5,4)
sb.countplot(clean_df.iloc[:,37], color=sb.color_palette()[0]);

plt.subplot(1,5,5)
sb.countplot(clean_df.iloc[:,38], color=sb.color_palette()[0]);
In [88]:
plt.figure(figsize=(20,6))

plt.subplot(1,5,1)
sb.countplot(clean_df.iloc[:,39], color=sb.color_palette()[1])

plt.subplot(1,5,2)
sb.countplot(clean_df.iloc[:,40], color=sb.color_palette()[1]);

plt.subplot(1,5,3)
sb.countplot(clean_df.iloc[:,41], color=sb.color_palette()[1]);

plt.subplot(1,5,4)
sb.countplot(clean_df.iloc[:,42], color=sb.color_palette()[1]);

plt.subplot(1,5,5)
sb.countplot(clean_df.iloc[:,43], color=sb.color_palette()[1]);
In [89]:
plt.figure(figsize=(20,6))

plt.subplot(1,5,1)
sb.countplot(clean_df.iloc[:,74], color=sb.color_palette()[2])

plt.subplot(1,5,2)
sb.countplot(clean_df.iloc[:,75], color=sb.color_palette()[2]);

plt.subplot(1,5,3)
sb.countplot(clean_df.iloc[:,76], color=sb.color_palette()[2]);

plt.subplot(1,5,4)
sb.countplot(clean_df.iloc[:,77], color=sb.color_palette()[2]);

plt.subplot(1,5,5)
sb.countplot(clean_df.iloc[:,78], color=sb.color_palette()[2]);
In [90]:
plt.figure(figsize=(20,6))

plt.subplot(1,5,1)
sb.countplot(clean_df.iloc[:,79], color=sb.color_palette()[3])

plt.subplot(1,5,2)
sb.countplot(clean_df.iloc[:,80], color=sb.color_palette()[3]);

plt.subplot(1,5,3)
sb.countplot(clean_df.iloc[:,81], color=sb.color_palette()[3]);

plt.subplot(1,5,4)
sb.countplot(clean_df.iloc[:,82], color=sb.color_palette()[3]);

plt.subplot(1,5,5)
sb.countplot(clean_df.iloc[:,83], color=sb.color_palette()[3]);

They all look very alike in each scale and subscale. I'll keep just the first column of every scale/subscale

In [91]:
# drop all the plausible levels 2 to 5
clean_df = clean_df.filter(regex=r'(^(?!Plausible_level_[2-5]_).*)')
In [92]:
clean_df.shape
Out[92]:
(298594, 44)
In [93]:
clean_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 298594 entries, 0 to 309871
Data columns (total 44 columns):
Country                                                                      298594 non-null object
OECD_country                                                                 298594 non-null object
Student_ID                                                                   298594 non-null object
Gender                                                                       298594 non-null object
Mother_Highest_Schooling                                                     280683 non-null category
Mother_Current_Job_Status                                                    288071 non-null object
Father_Highest_Schooling                                                     271374 non-null category
Father_Current_Job_Status                                                    278630 non-null object
Country_of_Birth_International_Self                                          293039 non-null object
International_Language_at_Home                                               285151 non-null object
First_language_learned                                                       89133 non-null object
Age_started_learning_test_language                                           29903 non-null category
Language_spoken_Mother                                                       29321 non-null object
Language_spoken_Father                                                       28843 non-null object
Standard_or_simplified_set_of_booklets                                       298594 non-null object
Attitude_towards_School:_Learning_Outcomes                                   192377 non-null float64
Attitude_towards_School:_Learning_Activities                                 191875 non-null float64
Sense_of_Belonging_to_School                                                 192899 non-null float64
Mathematics_Teacher_Classroom_Management                                     192608 non-null float64
Cognitive_Activation_in_Mathematics_Lessons                                  193974 non-null float64
Index_of_economic_social_and_cultural_status                                 294289 non-null float64
Home_educational_resources                                                   294250 non-null float64
Highest_educational_level_of_parents                                         290593 non-null category
Instrumental_Motivation_for_Mathematics                                      195270 non-null float64
Mathematics_Interest                                                         195450 non-null float64
Preference_for_Heritage_Language_in_Conversations_with_Family_and_Friends    29544 non-null category
Language_at_home                                                             296565 non-null object
Preference_for_Heritage_Language_in_Language_Reception_and_Production        28814 non-null category
Mathematics_Work_Ethic                                                       194110 non-null float64
Mathematics_Teacher_Support                                                  193216 non-null float64
Mathematics_Self_Concept                                                     193941 non-null float64
Teacher_Student_Relations                                                    193185 non-null float64
Subjective_Norms_in_Mathematics                                              195273 non-null float64
Language_of_the_test                                                         298594 non-null object
Plausible_level_1_in_mathematics                                             298594 non-null category
Plausible_level_1_in_math_Change_and_Relationships                           293641 non-null category
Plausible_level_1_in_math_Quantity                                           293641 non-null category
Plausible_level_1_in_math_Space_and_Shape                                    293641 non-null category
Plausible_level_1_in_math_Uncertainty_and_Data                               293641 non-null category
Plausible_level_1_in_math_Employ                                             284568 non-null category
Plausible_level_1_in_math_Formulate                                          284568 non-null category
Plausible_level_1_in_math_Interpret                                          284568 non-null category
Plausible_level_1_in_reading                                                 298594 non-null category
Plausible_level_1_in_science                                                 298594 non-null category
dtypes: category(16), float64(14), object(14)
memory usage: 80.6+ MB

Having a look above, the columns for

  • Language_of_the_test
  • Plausible_level_1_in_mathematics
  • Plausible_level_1_in_science
  • Standard_or_simplified_set_of_booklets
  • Country
  • OECD_country
  • Student_ID
  • Gender

are all complete (298594 non-null).


All other columns are incomplete

Plausible level in reading only contains 233762 non null, and it is a variable I'm definitely planning to use. Other possible limitations are given by

  • Index_of_economic_social_and_cultural_status: 294289 non-null
  • Language_athome(3_digit_code): 296565 non-null

Moreover, the math subscales as well are not complete (could it depend on the fact that student scoring lower did't complete all test items? Or on the booklet rotation system? I could not find the answer in the Thecnical report).

Since we have already seen them..

The plausible values

In [94]:
# math main scale and subscales

# create the grid
fig, axes = plt.subplots(2,4, figsize=(20,10))
axes = axes.flatten()

for i in range(8):
    plt.sca(axes[i])
    col = 34 + i # column to plot
    sb.countplot(clean_df.iloc[:,col], color=sb.color_palette()[0])
    plt.ylim(0, 70000) # limit of the main scale (it is the larger)

Distribution of the student in the math scale and subscales is unimodal, slightly right-skewed, with the majority of student in the levels 2 and 3, followed by 1 and below_1, and lastly 4, 5, 6.

The distribution of the subscale Uncertainty_and_Data is almost identical to the one of the main mathematical scale (first plot). The other subscale vary to different degrees.

It makes sense that there are not many students in the "6" level columns, since, at 15, math programs have surely not been completed. Hopefully the "Too low" ones are due to problem with the language (maybe students that moved from a different language area).

In [95]:
# reading and science NB: plotted together just for convenience
plt.figure(figsize=(10,6))

plt.subplot(1,2,1)
sb.countplot(clean_df.iloc[:,42], color=sb.color_palette()[2]);

plt.subplot(1,2,2)
sb.countplot(clean_df.iloc[:,43], color=sb.color_palette()[3]);

SCIENCE The science distribution looks more "normal", but the "Too low" column count is significant, so I would say it is still slightly right skewed, also if less than the math main scale one. As with mathematics, it makes sense that there are not many students in the "6" level column, since, at 15, science programs have surely not been completed. And again, hopefully the "Too low" ones are due to problem with the language.

READING Finally, this distribution looks normal. If people have access to schooling of any tipe, reading is surely one of the key subject and the skill is then applied to the study of almost all other topics.

Let's have a look at other variables, the ones with less non-null values.

  • Categories (some ordered, others not):

    Country, OECD_country, Language_at_home, International_Language_at_Home, Language_of_the_test, Country_of_Birth_International_Self, Gender, Standard_or_simplified_set_of_booklets, Highest_educational_level_of_parents, Mother_Highest_Schooling, Mother_Current_Job_Status, Father_Highest_Schooling, Father_Current_Job_Status

  • Numerical:

    Attitude_towards_School:_Learning_Outcomes, Attitude_towards_School:_Learning_Activities, Sense_of_Belonging_to_School, Mathematics_Teacher_Classroom_Management, Cognitive_Activation_in_Mathematics_Lessons, Index_of_economic_social_and_cultural_status, Home_educational_resources, Instrumental_Motivation_for_Mathematics, Mathematics_Interest, Mathematics_Work_Ethic, Mathematics_Teacher_Support, Mathematics_Self_Concept, Teacher_Student_Relations, Subjective_Norms_in_Mathematics

Other

CATEGORICAL variables

In [96]:
# CATEGORIES (some ordered, others not):
categories_to_plot = ['Country', 'OECD_country', 'Country_of_Birth_International_Self', 'Language_of_the_test', 'Language_at_home', 'International_Language_at_Home', 'Gender', 'Standard_or_simplified_set_of_booklets', 'Highest_educational_level_of_parents', 'Mother_Highest_Schooling', 'Mother_Current_Job_Status', 'Father_Highest_Schooling', 'Father_Current_Job_Status']

# create the grid
fig, axes = plt.subplots(7,2, figsize=(20,40))
axes = axes.flatten()
color_range = [0,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9]

for i in range(13):
    plt.sca(axes[i])
    col = categories_to_plot[i] # column to plot
    sb.countplot(data=clean_df, y=col, color=sb.color_palette()[color_range[i]])
    plt.xticks(rotation=20)
    
ATTENTION: Scales are all different!

Educational level of mother, father, parents: puzzling!

Father and mother (respectively 271,374 and 280,683 answers) highest schooling level reported is the "ISCED 3A, 4", while for the highest level between the two parents, "ISCED 5B" and "ISCED 5A, 6" contain together around 150k of the 290,593 available data.

FROM the Technical report, p. 307:

Students’ responses regarding parental education were classified using ISCED (OECD, 1999). Indices on parental education were constructed by recoding educational qualifications into the following categories: (0) None, (1) ISCED 1 (primary education), (2) ISCED 2 (lower secondary), (3) ISCED Level 3B or 3C (vocational/pre-vocational upper secondary), (4) ISCED 3A (general upper secondary) and/or ISCED 4 (non-tertiary post-secondary), (5) ISCED 5B (vocational tertiary) and (6) ISCED 5A, 6 (theoretically oriented tertiary and post-graduate). Indices with these categories were provided for the students’ mother (MISCED) and the students’ father (FISCED). In addition, the index of highest educational level of parents (HISCED) corresponds to the higher ISCED level of either parent.

The variable descripted (MISCED, FISCED and HISCED) are the one contained in the database. The survey collected the same information with the parent questionnaire (not administered in avery country), but that data is reported into another variable (PQHISCED) not included in the PISA2012 dataset, and, from the description above, it looks like it has not been used to adjust the HISCED values.

Parents work

Most of fathers are working full time, around 30k are working part-time and about the same are at home (looking for a job or not). As for mothers, most of them are working as well, full- or part-time, but a larger proportion is at home.

Set of booklets

Standard is the majority, but there are anyway more than 80k of semplified ones.

Gender

The students in the selected dataset are about 50% male and 50% female.

Language

  • International_language_at_home: in most cases is the same as the test language
  • Language_at_home: there are too many values and in many cases I'm not interested in them. Maybe it is possible to create broader groups (not sure it is needed, though).
  • Language_of_the_test: Spanish is the most represented, followed by English. I need to regroup them into my 3 language-type categories, but it is already appearent that the logographic type will account for much less rows than each of the others.

Country

  • Country_of_birth_international_self: in most cases is the same as the one where the students were tested.
  • Country: the most represented are Mexico and Italy (more than 30k students each), followed by Spain and Canada (a little above 20k each). Between 15k and 10k there are Australia, the UK, United Arab Emirates, Switzerland and Qatar (and probably the US, putting together US, Connetticut, Massachusetts and Florida). All other Countries are under 10k.
Notes
  • Not to forget: there where more students in the selected countries, but I dropped the rows where the test language was not one of the languages I selected)
  • I could have selected from the start the rows whose test language was in my list, but the initial cut of the dataset was done on the Countries (keeping those whose official language(s) was in the list) in order to work with a lighter dataset.

Language_of_the_test

Let's regroup the languages into my main three categories, which are already summarized into these variables:

shallow_ortography =['Spanish', 'Finnish', 'Italian', 'German']
deep_ortography = ['English', 'French', 'Arabic', 'English_Arabic']
logographic = ['Chinese', 'Japanese', 'Korean', 'Shanghai dialect', 'Mandarin', 'Cantonese']
In [97]:
clean_df.Language_of_the_test.unique()
Out[97]:
array(['Arabic', 'English', 'Spanish', 'German', 'French', 'Italian',
       'Finnish', 'Cantonese', 'Japanese', 'Korean', 'Chinese',
       'English_Arabic', 'Shanghai dialect', 'Mandarin'], dtype=object)
In [98]:
# create a column "Language_type" and populate it on the base of the language_of_the_test variable

clean_df.loc[clean_df.Language_of_the_test.isin(shallow_ortography), 'Language_type'] = 'shallow orthography'
clean_df.loc[clean_df.Language_of_the_test.isin(deep_ortography), 'Language_type'] = 'deep orthography'
clean_df.loc[clean_df.Language_of_the_test.isin(logographic), 'Language_type'] = 'logographic'
In [99]:
# check my new column
clean_df.Language_type.value_counts()
Out[99]:
shallow orthography    152681
deep orthography       114292
logographic             31621
Name: Language_type, dtype: int64
In [100]:
# plot it
labels = []
for a, b in zip(clean_df.Language_type.value_counts().index, clean_df.Language_type.value_counts().values/clean_df.Language_type.count()*100):
    labels.append(a + '\n' + '{:.2f}'.format(b) + '%')
plt.pie(clean_df.Language_type.value_counts(), labels=labels, startangle=-94, counterclock=True);
plt.axis('square')
Out[100]:
(-1.1207301120128714,
 1.1051868714843143,
 -1.1072893701779802,
 1.1186276133192055)

Logographic languages account for a much smaller proportion of the overall data than the other two language groups.


NUMERIC variables

In [101]:
# numeric variables to plot
num_var_to_plot = ['Attitude_towards_School:_Learning_Outcomes', 'Attitude_towards_School:_Learning_Activities', 'Sense_of_Belonging_to_School', 'Mathematics_Teacher_Classroom_Management', 'Cognitive_Activation_in_Mathematics_Lessons', 'Index_of_economic_social_and_cultural_status', 'Home_educational_resources', 'Instrumental_Motivation_for_Mathematics', 'Mathematics_Interest', 'Mathematics_Work_Ethic', 'Mathematics_Teacher_Support', 'Mathematics_Self_Concept', 'Teacher_Student_Relations', 'Subjective_Norms_in_Mathematics']
In [102]:
# create the grid
fig, axes = plt.subplots(7,2, figsize=(20,40))
axes = axes.flatten()
color_range = [0,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9]

for i in range(14):
    plt.sca(axes[i])
    col = num_var_to_plot[i] # column to plot
    plt.hist(data=clean_df, x=col, color=sb.color_palette()[color_range[i]])
    plt.title(col)
C:\Users\annap\Anaconda3\lib\site-packages\numpy\lib\histograms.py:839: RuntimeWarning: invalid value encountered in greater_equal
  keep = (tmp_a >= first_edge)
C:\Users\annap\Anaconda3\lib\site-packages\numpy\lib\histograms.py:840: RuntimeWarning: invalid value encountered in less_equal
  keep &= (tmp_a <= last_edge)

(warning above because of NaNs)

Well, it was good to see them all, but the variables that I really want here are

Index_of_economic_social_cultural_status

From the Technical report, p.353:

[...] the ESCS in PISA 2012 consisted of three sub-components, the highest parental occupation (HISEI), the highest parental education expressed as years of schooling (PARED) and the index of home possessions (HOMEPOS) which comprised all items on the WEALTH, CULTPOS and HEDRES scales, as well as books in the home (ST28Q01) [...]

This index has been wheighted in order to be comparable across Countries. Of course it cannot be perfect, because of the great socioeconomic and cultural differences there are among Countries, but it is my best bet for comparing students performances, trying to minimize the differences coming from factors that are external from school.

I could also consider at least one variable accounting for the relationship that a student has with school and one for mathematics

Attitude_towards_School:_Learning_Activities

From the Technical report, p.335:

Four items regarding attitude towards school in terms of learning activities (ATTLNACT) were included in the Student Questionnaire with four response categories from “Strongly agree” to “Strongly disagree”. [...] a) Trying hard at school will help me get a good job b) Trying hard at school will help me get into a good c) I enjoy receiving good d) Trying hard at school is important

I choose this variable, instead of "Learning Outcomes", because the consistency of its scale is described as high in the Technical report, while it is moderate-high for the other.

Cognitive_Activation_in_Mathematics_Lessons

From the Technical report, p.331:

Nine items measuring cognitive activation in mathematics lessons (COGACT) were used in the Main Survey of PISA 2012. Table 16.29 shows the item wording and the international item parameters for this scale. Response categories were “Always or almost always”, “Often”, “Sometimes” and “Never or rarely”. [Examples are:] a) The teacher asks questions that make us reflect on the problem c) The teacher asks us to decide on our own procedures for solving complex problems d) The teacher presents problems for which there is no immediately obvious method of solution f) The teacher helps us to learn from mistakes we have made

I like this variable because, psychologically (my assumption here) it has to somehow include both a hint to the interest the student has on the subject and, as well, of the support the teacher gives to his/her class.

In [103]:
# Index_of_economic_social_cultural_status (ESCS): how many ESCS classes do I want to create? If I want to create any..

fig, axs = plt.subplots(2,3, figsize=(20,10))
axs = axs.flatten()
bins = [5, 10, 20, 40, 80, 160]

for i in range(6):
    plt.sca(axs[i])
    plt.hist(data=clean_df, x='Index_of_economic_social_and_cultural_status', bins=bins[i]);
    plt.title('ESCS index distribution: {} bins'.format(bins[i]))
In [104]:
ESCS0_below = (clean_df.Index_of_economic_social_and_cultural_status < 0).sum()
ESCS0_above = (clean_df.Index_of_economic_social_and_cultural_status >= 0).sum()
print('students below ESCS 0: {};\nstudents at or above ESCS 0: {}\ndifference: {}'.format(ESCS0_below, ESCS0_above, (ESCS0_below-ESCS0_above)))

ESCS_minus2_below = (clean_df.Index_of_economic_social_and_cultural_status < -2).sum()
ESCS_plus2_above = (clean_df.Index_of_economic_social_and_cultural_status > 2).sum()
print('\nstudents below ESCS -2: {};\nstudents at or above ESCS +2: {}\ndifference: {}'.format(ESCS_minus2_below, ESCS_plus2_above, (ESCS_minus2_below-ESCS_plus2_above)))

print('\nnumber of NaNs: ', clean_df.Index_of_economic_social_and_cultural_status.isna().sum())
students below ESCS 0: 157212;
students at or above ESCS 0: 137077
difference: 20135

students below ESCS -2: 19435;
students at or above ESCS +2: 1658
difference: 17777

number of NaNs:  4305

ATTENTION : this variable still has NaNs!

ESCS distribution is left skewed. If we take 0 as a neutral point, were students are "OK" (not priviledged, but neither disadvantaged), then there are 20135 more student on the disadvantaged side, than in the advantaged one. And a lot of them is in the lower part (ESCS < -2): 17777

In [105]:
# about the NaNs, where are them?
clean_df.loc[clean_df.Index_of_economic_social_and_cultural_status.isna()]['Language_of_the_test'].value_counts()
Out[105]:
English             1253
German              1132
Spanish              666
Arabic               395
French               313
Japanese             166
Italian              117
Finnish              108
Cantonese             54
English_Arabic        36
Mandarin              23
Chinese               21
Korean                11
Shanghai dialect      10
Name: Language_of_the_test, dtype: int64

The ESCS index is missing in 4305 rows. As seen above, they are mostly partaining to rows whose "Language_of_the_test" is in one of the two most represented. I will delete these rows.

In [106]:
clean_df = clean_df.loc[~(clean_df.Index_of_economic_social_and_cultural_status.isna())]
In [107]:
# Attitude_towards_School:_Learning_Activities
fig, axs = plt.subplots(2,3, figsize=(20,10))
axs = axs.flatten()
bins = [5, 10, 20, 40, 80, 160]

for i in range(6):
    plt.sca(axs[i])
    plt.hist(data=clean_df, x='Attitude_towards_School:_Learning_Activities', bins=bins[i]);
    plt.title('Attitude_towards_School:_Learning_Activities: {} bins'.format(bins[i]))
In [108]:
clean_df['Attitude_towards_School:_Learning_Activities'].nunique()
Out[108]:
76
In [109]:
# Cognitive_Activation_in_Mathematics_Lessons
fig, axs = plt.subplots(2,3, figsize=(20,10))
axs = axs.flatten()
bins = [5, 10, 20, 40, 80, 160]

for i in range(6):
    plt.sca(axs[i])
    plt.hist(data=clean_df, x='Cognitive_Activation_in_Mathematics_Lessons', bins=bins[i]);
    plt.title('Cognitive_Activation_in_Mathematics_Lessons: {} bins'.format(bins[i]))
In [110]:
# unique values, NaNs and their row "Language of test" values
print(clean_df['Cognitive_Activation_in_Mathematics_Lessons'].nunique())
print(clean_df.Cognitive_Activation_in_Mathematics_Lessons.isna().sum())
print(clean_df.loc[clean_df.Cognitive_Activation_in_Mathematics_Lessons.isna()]['Language_of_the_test'].value_counts())
1015
101499
Spanish             32177
English             25994
Italian             10035
German               7349
French               6460
Arabic               6076
Finnish              2487
Japanese             2055
Mandarin             2025
Shanghai dialect     1711
Korean               1680
Chinese              1529
Cantonese            1490
English_Arabic        431
Name: Language_of_the_test, dtype: int64
In [111]:
# the language_of_the_test values:
clean_df.Language_of_the_test.value_counts()
Out[111]:
Spanish             92476
English             76009
Italian             29725
German              21310
French              18545
Arabic              16653
Finnish              7147
Japanese             6185
Mandarin             6023
Shanghai dialect     5167
Korean               5022
Chinese              4549
Cantonese            4390
English_Arabic       1088
Name: Language_of_the_test, dtype: int64

The last two variables have a lot of NaNs, again many of them in the two bigger language-type groups, but a significative number also in the logographic group. Before deciding if removing those rows or not, I'll see what happen with the other variables.

Attitude_towards_school:_Learning_Acrivities has a small number of unique values (76), due to the fact that the scale comes from the responses given to only 4 questions. It works almost like a category. Cognitive_Activation_in_Mathematics_Lessons has more (1015), depending on 9 questions, but anyway, even if the amount of data is big, 20 bins give a good representation.

Last thing, there still are the nulls in the Plausible_level variables, in the International_Language_at_Home and in the Country_of_Birth_International_Self. If they are in the two main groups, I'll delete the rows.

In [112]:
# let's see where they are
clean_df.loc[clean_df.Plausible_level_1_in_math_Formulate.isna() | clean_df.Plausible_level_1_in_math_Quantity.isna()]['Language_of_the_test'].value_counts()
Out[112]:
Spanish    8997
English    4915
Name: Language_of_the_test, dtype: int64
In [113]:
# delete them
clean_df = clean_df.loc[~(clean_df.Plausible_level_1_in_math_Formulate.isna() | clean_df.Plausible_level_1_in_math_Quantity.isna())]
In [114]:
clean_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 280377 entries, 0 to 304893
Data columns (total 45 columns):
Country                                                                      280377 non-null object
OECD_country                                                                 280377 non-null object
Student_ID                                                                   280377 non-null object
Gender                                                                       280377 non-null object
Mother_Highest_Schooling                                                     266940 non-null category
Mother_Current_Job_Status                                                    273537 non-null object
Father_Highest_Schooling                                                     258589 non-null category
Father_Current_Job_Status                                                    265305 non-null object
Country_of_Birth_International_Self                                          277597 non-null object
International_Language_at_Home                                               269721 non-null object
First_language_learned                                                       88560 non-null object
Age_started_learning_test_language                                           29588 non-null category
Language_spoken_Mother                                                       29021 non-null object
Language_spoken_Father                                                       28563 non-null object
Standard_or_simplified_set_of_booklets                                       280377 non-null object
Attitude_towards_School:_Learning_Outcomes                                   182769 non-null float64
Attitude_towards_School:_Learning_Activities                                 182258 non-null float64
Sense_of_Belonging_to_School                                                 183165 non-null float64
Mathematics_Teacher_Classroom_Management                                     182774 non-null float64
Cognitive_Activation_in_Mathematics_Lessons                                  183890 non-null float64
Index_of_economic_social_and_cultural_status                                 280377 non-null float64
Home_educational_resources                                                   278947 non-null float64
Highest_educational_level_of_parents                                         276497 non-null category
Instrumental_Motivation_for_Mathematics                                      185106 non-null float64
Mathematics_Interest                                                         185267 non-null float64
Preference_for_Heritage_Language_in_Conversations_with_Family_and_Friends    29246 non-null category
Language_at_home                                                             280304 non-null object
Preference_for_Heritage_Language_in_Language_Reception_and_Production        28529 non-null category
Mathematics_Work_Ethic                                                       184083 non-null float64
Mathematics_Teacher_Support                                                  183287 non-null float64
Mathematics_Self_Concept                                                     183946 non-null float64
Teacher_Student_Relations                                                    183404 non-null float64
Subjective_Norms_in_Mathematics                                              185026 non-null float64
Language_of_the_test                                                         280377 non-null object
Plausible_level_1_in_mathematics                                             280377 non-null category
Plausible_level_1_in_math_Change_and_Relationships                           280377 non-null category
Plausible_level_1_in_math_Quantity                                           280377 non-null category
Plausible_level_1_in_math_Space_and_Shape                                    280377 non-null category
Plausible_level_1_in_math_Uncertainty_and_Data                               280377 non-null category
Plausible_level_1_in_math_Employ                                             280377 non-null category
Plausible_level_1_in_math_Formulate                                          280377 non-null category
Plausible_level_1_in_math_Interpret                                          280377 non-null category
Plausible_level_1_in_reading                                                 280377 non-null category
Plausible_level_1_in_science                                                 280377 non-null category
Language_type                                                                280377 non-null object
dtypes: category(16), float64(14), object(15)
memory usage: 68.5+ MB
In [115]:
# Country_of_Birth_International_Self and International_Language_at_Home NaNs : where
clean_df.loc[clean_df.Country_of_Birth_International_Self.isna() | clean_df.International_Language_at_Home.isna()]['Language_of_the_test'].value_counts()
Out[115]:
English             2861
Italian             2768
Spanish             2366
German              1713
French              1132
Arabic               798
Mandarin             424
Chinese              249
English_Arabic       105
Japanese             104
Cantonese             98
Finnish               87
Korean                68
Shanghai dialect      59
Name: Language_of_the_test, dtype: int64
In [116]:
# delete
clean_df = clean_df.loc[~(clean_df.Country_of_Birth_International_Self.isna() | clean_df.International_Language_at_Home.isna())]
In [117]:
# subset with the variables I'll keep (reordered as in the summary below)
exploration_df = clean_df[['Country', 'Student_ID', 'Gender', 'Index_of_economic_social_and_cultural_status', 'Country_of_Birth_International_Self', 'Language_at_home', 'Language_of_the_test', 'Language_type', 'International_Language_at_Home', 'Standard_or_simplified_set_of_booklets', 'Attitude_towards_School:_Learning_Activities', 'Cognitive_Activation_in_Mathematics_Lessons', 'Plausible_level_1_in_mathematics', 'Plausible_level_1_in_math_Change_and_Relationships', 'Plausible_level_1_in_math_Quantity', 'Plausible_level_1_in_math_Space_and_Shape', 'Plausible_level_1_in_math_Uncertainty_and_Data', 'Plausible_level_1_in_math_Employ', 'Plausible_level_1_in_math_Formulate', 'Plausible_level_1_in_math_Interpret', 'Plausible_level_1_in_science', 'Plausible_level_1_in_reading']].copy()

Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

Summary for the variables I'll keep:

  • Country : the most represented are Mexico and Italy (more than 30k students each), followed by Spain and Canada (a little above 20k each). Between 15k and 10k there are Australia, the UK, United Arab Emirates, Switzerland and Qatar (and probably the US, putting together US, Connetticut, Massachusetts and Florida). All other Countries are under 10k;
  • Student_ID
  • Gender : about 50% females, 50% males
  • Index_of_economic_social_and_cultural_status : ESCS distribution is left skewed. If we take 0 as a neutral point, were students are "OK" (not priviledged, but neither disadvantaged), then there are more student on the disadvantaged side, than in the advantaged one. And a lot of them is in the lower part (ESCS < -2): values updated reranning the cell after the last cleaning:

    students below ESCS 0: 140554, students at or above ESCS 0: 126991, difference: 13563

    students below ESCS -2: 16582, students at or above ESCS +2: 1463, difference: 15119

  • Country_of_Birth_International_Self : in most cases is the same as the one where the students were tested;
  • Language_at_home : there are too many values and in many cases I'm not interested in them. Maybe it is possible to create broader groups (not sure it is needed, though);
  • Language_of_the_test : Spanish is the most represented, followed by English. I need to regroup them into my 3 language-type categories, but it is already appearent that the logographic type will account for much less rows than each of the others;
  • Language_type : my new variable: three categories "shallow_ortography" (Spanish, Finnish, Italian, German), "deep_ortography" (English, French, Arabic, English_Arabic), "logographic" (Chinese, Japanese, Korean, Shanghai dialect, Mandarin, Cantonese)
  • International_Language_at_Home : in most cases is the same as the test language;
  • Standard_or_simplified_set_of_booklets : standard is the majority, but there are anyway around 75k of semplified ones;
  • Attitude_towards_School:_Learning_Activities HAS NaNs!* : this var has 76 values, it works almost like a category. It is good to keep it numerical because the distances between the values are not regular. It spans from -3.38 to +1.21;
  • Cognitive_Activation_in_Mathematics_Lessons HAS NaNs!* : this var has 1015 values, that span from -3.88 to +3.20;
  • Plausible_level_1_in_mathematics and subscales

    • Plausible_level_1_in_math_Change_and_Relationships
    • Plausible_level_1_in_math_Quantity
    • Plausible_level_1_in_math_Space_and_Shape
    • Plausible_level_1_in_math_Uncertainty_and_Data
    • Plausible_level_1_in_math_Employ
    • Plausible_level_1_in_math_Formulate
    • Plausible_level_1_in_math_Interpret

    Distribution of the student in the math scale and subscales is unimodal, slightly right-skewed, with the majority of student in the levels 2 and 3, followed by 1 and below_1, and lastly 4, 5, 6.

    The distribution of the subscale Uncertainty_and_Data is almost identical to the one of the main mathematical scale. The other subscale vary to different degrees.

    It makes sense that there are not many students in the "6" level columns, since, at 15, math programs have surely not been completed. Hopefully the "Too low" ones are due to problem with the language (maybe students that moved from a different language area);

  • Plausible_level_1_in_science : the distribution looks more "normal", but the "Too low" column count is significant, so I would say it is still slightly right skewed, also if less than the math main scale one. As with mathematics, it makes sense that there are not many students in the "6" level column, since, at 15, science programs have surely not been completed. And again, hopefully the "Too low" ones are due to problem with the language;

  • Plausible_level_1_in_reading : this distribution looks normal. If people have access to schooling of any tipe, reading is surely one of the key subject and the skill is then applied to the study of almost all other topics.

* Attitude_towards_School:_Learning_Activities and Cognitive_Activation_in_Mathematics_Lessons have a lot of NaNs, many of them in the two bigger language-type groups, but a significative number also in the logographic group. Before deciding if removing those rows or not, I'll see what happen with the other variables.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I considered the use of the variables about the highest level of education of parents (mother, father and both of them), thenumber of values for each column was similar and the last one was theoretically derived from the first two, but it was not possible (the values were not comparable, details above).

It turned out the ESCS variable has been build on top of many variables, included the educational level of parents. Moreover it has been weighted to be comparable among Countries, therefore is a much better choice.


The Plausible_levels variables I am going to used are derived from the original Plausible_values given in the dataset PISA2012 using the transformation bands described in the Technical reports 2012, 2019 and 2016. I changed the values for interpretability. From the values you could tell that to a higher score it corresponds better performance in the task, while the proficiency level offer a description of the specific abilities supposedly mastered by the student.


The Language_type is a variable I introduced, to regroup the rows into one of these 3 categories: The language the test was administered in is

  • alphabetic, with a highly phonemic orthography (or shallow orthography)
  • alphabetic, with a more complex relationship between orthography and pronunciation, a deeper orthography
  • logographic

The language I chose to select from the dataset are:

  • "shallow_ortography" (Spanish, Finnish, Italian, German)
  • "deep_ortography" (English, French, Arabic, English_Arabic)
  • "logographic" (Chinese, Japanese, Korean, Shanghai dialect, Mandarin, Cantonese)

Unfortunately the last group contains a lot less data than the other two. On the bright side, even that group is not that small, with 30334 rows, and it is composed by many languages, and different schooling systems, as the other two groups (see cell below this for composition). I might decide to reduce the subgroups of Spanish, Italian and English where the data come from one country and has more than 7k instances.


Lastly, I dropped a few thousends of rows where there were missing values for the variables I am interested in. There are still two variables that presents NaNs, but dropping those lines would delete up to 1/3 of data for subgroups that are not too large (e.g. 2055 rows of 6081 for Japanese, 2025 of 5599 for Mandarin). I will wait to see some graphs before deciding to drop those.

In [118]:
# Language_type by Language_of_the_test and Country
exploration_df.groupby(['Language_type','Language_of_the_test'])['Country'].value_counts()
Out[118]:
Language_type        Language_of_the_test  Country             
deep orthography     Arabic                United Arab Emirates     5963
                                           Qatar                    5593
                                           Tunisia                  4299
                     English               Canada                  15133
                                           Australia               13372
                                           United Kingdom          11706
                                           Singapore                5096
                                           United Arab Emirates     4874
                                           Ireland                  4764
                                           New Zealand              4079
                                           Qatar                    3236
                                           Florida (USA)            1814
                                           Massachusetts (USA)      1655
                                           Connecticut (USA)        1632
                                           Macao-China               680
                                           Luxembourg                130
                                           Hong Kong-China            62
                     English_Arabic        Qatar                     983
                     French                Canada                   5138
                                           France                   4309
                                           Switzerland              4201
                                           Belgium                  2627
                                           Luxembourg               1138
logographic          Cantonese             Hong Kong-China          4292
                     Chinese               Macao-China              4300
                     Japanese              Japan                    6081
                     Korean                Korea                    4954
                     Mandarin              Chinese Taipei           5599
                     Shanghai dialect      China-Shanghai           5108
shallow orthography  Finnish               Finland                  7060
                     German                Switzerland              5781
                                           Austria                  4319
                                           Germany                  3879
                                           Luxembourg               3220
                                           Italy                    1481
                                           Belgium                   651
                                           Liechtenstein             266
                     Italian               Italy                   26519
                                           Switzerland               438
                     Spanish               Mexico                  32663
                                           Spain                   20880
                                           Chile                    6686
                                           Peru                     5698
                                           Argentina                5639
                                           Uruguay                  5090
                                           Costa Rica               4457
Name: Country, dtype: int64

Bivariate Exploration

ATTENTION: I tried a PairGrid to have a quick look at some of the variables names, since I was just waiting for a good ecuse to reduce the data, here I go.

I'll first delete the rows with NaNs, then I'll have a look at the proportion and decide how to further trim the data

Further trimming and check the univariate distributions

In [119]:
exploration_df = exploration_df.loc[~(exploration_df.Cognitive_Activation_in_Mathematics_Lessons.isna() | exploration_df['Attitude_towards_School:_Learning_Activities'].isna())]
In [120]:
exploration_df.Language_type.value_counts()
Out[120]:
shallow orthography    87151
deep orthography       66002
logographic            20125
Name: Language_type, dtype: int64
In [121]:
exploration_df.groupby(['Language_type', 'Language_of_the_test'])['Country'].value_counts()
Out[121]:
Language_type        Language_of_the_test  Country             
deep orthography     Arabic                United Arab Emirates     3789
                                           Qatar                    3324
                                           Tunisia                  2667
                     English               Canada                   9975
                                           Australia                8688
                                           United Kingdom           7696
                                           Singapore                3357
                                           Ireland                  3159
                                           United Arab Emirates     3069
                                           New Zealand              2644
                                           Qatar                    2011
                                           Florida (USA)            1172
                                           Massachusetts (USA)      1065
                                           Connecticut (USA)        1025
                                           Macao-China               453
                                           Luxembourg                 86
                                           Hong Kong-China            42
                     English_Arabic        Qatar                     551
                     French                Canada                   3326
                                           France                   2810
                                           Switzerland              2741
                                           Belgium                  1646
                                           Luxembourg                706
logographic          Cantonese             Hong Kong-China          2817
                     Chinese               Macao-China              2857
                     Japanese              Japan                    4043
                     Korean                Korea                    3289
                     Mandarin              Chinese Taipei           3699
                     Shanghai dialect      China-Shanghai           3420
shallow orthography  Finnish               Finland                  4549
                     German                Switzerland              3807
                                           Austria                  2834
                                           Germany                  2463
                                           Luxembourg               2097
                                           Italy                     966
                                           Belgium                   420
                                           Liechtenstein             168
                     Italian               Italy                   17404
                                           Switzerland               288
                     Spanish               Mexico                  21419
                                           Spain                   13673
                                           Chile                    4418
                                           Argentina                3502
                                           Peru                     3284
                                           Uruguay                  3086
                                           Costa Rica               2773
Name: Country, dtype: int64

I will sample Italian, Spanish, English, French and Greman (a little from each country)

In [122]:
fractions_to_drop = {'Italian': {'Italy':.85},
                     
                     'Spanish': {'Mexico':.95, 'Spain':.9, 'Chile':.75, 'Argentina':.65,
                                'Uruguay':.6, 'Peru':.65, 'Costa Rica':.6},
                     
                     'English': {'Canada':.9, 'Australia':.9, 'United Kingdom':.9,
                                'New Zealand':.7, 'Ireland':.75, 'Singapore':.8,
                                 'United Arab Emirates':.8, 'Qatar':.65, 'Florida (USA)':.25,
                                'Massachusetts (USA)':.25, 'Connecticut (USA)':.25},
                     
                     'French': {'Canada':.5, 'France':.5, 'Switzerland':.4, 'Belgium':.35},
                     
                     'German': {'Switzerland':.5, 'Germany':.4, 'Luxembourg':.35, 'Austria':.35},
                     
                     'Arabic': {'United Arab Emirates':.4, 'Qatar':.35, 'Tunisia':.4}
                    }



for lang in fractions_to_drop:
    for country, fraction in fractions_to_drop[lang].items():
        exploration_df = exploration_df.drop(exploration_df.loc[(exploration_df.Country == country) & (exploration_df.Language_of_the_test == lang)].sample(frac=fraction).index)
        
    
In [123]:
exploration_df.Language_type.value_counts()
Out[123]:
shallow orthography    23848
deep orthography       22310
logographic            20125
Name: Language_type, dtype: int64
In [124]:
exploration_df.groupby(['Language_type'])['Language_of_the_test'].value_counts()
Out[124]:
Language_type        Language_of_the_test
deep orthography     English                 9236
                     French                  6489
                     Arabic                  6034
                     English_Arabic           551
logographic          Japanese                4043
                     Mandarin                3699
                     Shanghai dialect        3420
                     Korean                  3289
                     Chinese                 2857
                     Cantonese               2817
shallow orthography  Spanish                 8260
                     German                  8140
                     Finnish                 4549
                     Italian                 2899
Name: Language_of_the_test, dtype: int64
In [125]:
# a quick look at the univariate distribution of the variables I want to focus on

plt.figure(figsize=(20,4))

plt.subplot(1,4,1)
plt.hist(data=exploration_df, x='Index_of_economic_social_and_cultural_status', bins=40, color=sb.color_palette()[4]);
plt.title('ESCS index distribution: {} bins'.format('40'))

plt.subplot(1,4,2)
sb.countplot(exploration_df.Plausible_level_1_in_mathematics, color=sb.color_palette()[0])

plt.subplot(1,4,3)
sb.countplot(exploration_df.Plausible_level_1_in_science, color=sb.color_palette()[3])

plt.subplot(1,4,4)
sb.countplot(exploration_df.Plausible_level_1_in_reading, color=sb.color_palette()[2])
Out[125]:
<matplotlib.axes._subplots.AxesSubplot at 0x23e39c83408>
In [126]:
# continue..
plt.figure(figsize=(20,4))

plt.subplot(1,4,1)
sb.countplot(exploration_df.Gender)
plt.subplot(1,4,2)
sb.countplot(exploration_df.International_Language_at_Home)
plt.subplot(1,4,3)
sb.countplot(exploration_df.Country_of_Birth_International_Self)
plt.subplot(1,4,4)
sb.countplot(exploration_df.Standard_or_simplified_set_of_booklets)
Out[126]:
<matplotlib.axes._subplots.AxesSubplot at 0x23e4627dc88>
In [127]:
# continue..
plt.figure(figsize=(20,4))

plt.subplot(1,2,1)
plt.hist(exploration_df['Attitude_towards_School:_Learning_Activities'], bins=10)
plt.subplot(1,2,2)
plt.hist(exploration_df.Cognitive_Activation_in_Mathematics_Lessons, bins=20);

All these distributions look similar enough to those of the clean_df. Move on!

Bivariate exploration - start again!

After further trimming the data, we can move on! :)

In [128]:
exploration_df.columns
Out[128]:
Index(['Country', 'Student_ID', 'Gender',
       'Index_of_economic_social_and_cultural_status',
       'Country_of_Birth_International_Self', 'Language_at_home',
       'Language_of_the_test', 'Language_type',
       'International_Language_at_Home',
       'Standard_or_simplified_set_of_booklets',
       'Attitude_towards_School:_Learning_Activities',
       'Cognitive_Activation_in_Mathematics_Lessons',
       'Plausible_level_1_in_mathematics',
       'Plausible_level_1_in_math_Change_and_Relationships',
       'Plausible_level_1_in_math_Quantity',
       'Plausible_level_1_in_math_Space_and_Shape',
       'Plausible_level_1_in_math_Uncertainty_and_Data',
       'Plausible_level_1_in_math_Employ',
       'Plausible_level_1_in_math_Formulate',
       'Plausible_level_1_in_math_Interpret', 'Plausible_level_1_in_science',
       'Plausible_level_1_in_reading'],
      dtype='object')
In [129]:
# lets look fist to a possible correlation among the math scale and subscales and the reading scale
math_vars = ['Plausible_level_1_in_mathematics',
               'Plausible_level_1_in_math_Change_and_Relationships',
               'Plausible_level_1_in_math_Quantity',
               'Plausible_level_1_in_math_Space_and_Shape',
               'Plausible_level_1_in_math_Uncertainty_and_Data',
               'Plausible_level_1_in_math_Employ',
               'Plausible_level_1_in_math_Formulate',
               'Plausible_level_1_in_math_Interpret']


fig, axs = plt.subplots(2,4, figsize=(20,10))
axs = axs.flatten()

for i in range(8):
    plt.sca(axs[i])
    sb.countplot(data=exploration_df, x=math_vars[i], hue='Plausible_level_1_in_reading', palette='mako_r');
    

There is a clear positive correlation between test result in reading and test results in mathematics.

Math subscale are very similar to the main scale, I'll keep only the main.

In [130]:
# math level by ESCS
g = sb.FacetGrid(data=exploration_df, col='Plausible_level_1_in_mathematics', col_wrap=4)
g.map(plt.hist, 'Index_of_economic_social_and_cultural_status', bins=40);
In [131]:
# math level by ESCS again
sb.violinplot(data=exploration_df, x='Plausible_level_1_in_mathematics', y='Index_of_economic_social_and_cultural_status', palette='mako_r', inner='quartile');
sb.boxplot(data=exploration_df, x='Plausible_level_1_in_mathematics', y='Index_of_economic_social_and_cultural_status', color='white');

There looks to be a correlation here as well, and it is a bit sad: very bad levels in mathematics can be "achieved" by students from every socioeconomic and cultural background. on the other side, students living in the most disadvantaged condition are not presents in the top mathematics scores.

In [132]:
# reading level by ESCS
sb.violinplot(data=exploration_df, x='Plausible_level_1_in_reading', y='Index_of_economic_social_and_cultural_status', palette='mako_r', inner='quartile');
sb.boxplot(data=exploration_df, x='Plausible_level_1_in_reading', y='Index_of_economic_social_and_cultural_status', color='white');

The relationship between reading and ESCS is similar to math and ESCS, maybe a little less clean in the lowest reading levels and more defined in the upper ones.

In [133]:
# Math levels and Gender
fig, axs = plt.subplots(2,4, figsize=(20,10))
axs = axs.flatten()

for i in range(8):
    plt.sca(axs[i])
    sb.countplot(data=exploration_df, x=math_vars[i], hue='Gender');
In [134]:
# Math level by gender (normalized) 
math_gender_norm = exploration_df.groupby('Gender')['Plausible_level_1_in_mathematics'].value_counts(normalize=True)
math_gender_norm = math_gender_norm.mul(100)
math_gender_norm = math_gender_norm.rename('percent').reset_index()

# turn Plausible_level into an ordered category again
math_gender_norm.Plausible_level_1_in_mathematics = math_gender_norm.Plausible_level_1_in_mathematics.astype(ordered_math)

g = sb.catplot(data=math_gender_norm, x='Plausible_level_1_in_mathematics',y='percent',hue='Gender',
               hue_order=['Male', 'Female'], kind='bar')

Girls seem to score a bit worst, generally: they are more than boys in the lower categories, and fewer than them in the 3 top categories.

In [135]:
# Reading level by gender (normalized) 
read_gender_norm = exploration_df.groupby('Gender')['Plausible_level_1_in_reading'].value_counts(normalize=True)
read_gender_norm = read_gender_norm.mul(100)
read_gender_norm = read_gender_norm.rename('percent').reset_index()

# turn Plausible_level into an ordered category again
read_gender_norm.Plausible_level_1_in_reading = read_gender_norm.Plausible_level_1_in_reading.astype(ordered_reading)

g = sb.catplot(data=read_gender_norm, x='Plausible_level_1_in_reading',y='percent',hue='Gender',
               hue_order=['Male', 'Female'], kind='point')

Contrary to math levels by gender, READING levels by gender suggests that girls are better than boys in this task.

In [136]:
# the one I am really interested in: Math level by language type
fig, axs = plt.subplots(2,4, figsize=(20,10))
axs = axs.flatten()

for i in range(8):
    plt.sca(axs[i])
    sb.countplot(data=exploration_df, x=math_vars[i], hue='Language_type', palette='icefire');
    #I know it is a diverging palette, but with 3 categories it just is clearer and more pleasant than all the qualitative ones

There appear to be a difference in the performance among the 3 linguistic groups, but before trying to describe, let's normalize the counts.

In [137]:
# distribution of math scores by language type (normalized)
math_language_norm = exploration_df.groupby('Language_type')['Plausible_level_1_in_mathematics'].value_counts(normalize=True)
math_language_norm = math_language_norm.mul(100)
math_language_norm = math_language_norm.rename('percent').reset_index()

# turn Plausible_level into an ordered category again
math_language_norm.Plausible_level_1_in_mathematics = math_language_norm.Plausible_level_1_in_mathematics.astype(ordered_math)

g = sb.catplot(data=math_language_norm, x='Plausible_level_1_in_mathematics',y='percent',hue='Language_type',
               hue_order=['deep orthography', 'shallow orthography', 'logographic'], kind='bar', palette='icefire')

After normalization (now the percentages within groups are shown) the plot changed very little.

The students using a logographic language perform better, their distribution is definitely left skewed. Both the deep and shallow language group show a distribution of math scores skewed to the right, with most of the student in the 4 lowest categories.

In [138]:
# math scores vs international language at home (normalized)
math_int_lang_norm = exploration_df.groupby('International_Language_at_Home')['Plausible_level_1_in_mathematics'].value_counts(normalize=True)
math_int_lang_norm = math_int_lang_norm.mul(100)
math_int_lang_norm = math_int_lang_norm.rename('percent').reset_index()

# turn Plausible_level into an ordered category again
math_int_lang_norm.Plausible_level_1_in_mathematics = math_int_lang_norm.Plausible_level_1_in_mathematics.astype(ordered_math)

sb.catplot(data=math_int_lang_norm, x='Plausible_level_1_in_mathematics', y='percent', hue='International_Language_at_Home', 
           kind='bar')
Out[138]:
<seaborn.axisgrid.FacetGrid at 0x23daaf67388>
In [139]:
# reading score vs international language at home (normalized)
# just percentages, connected dot, so that it is easier to compare them
read_int_lang_norm = exploration_df.groupby('International_Language_at_Home')['Plausible_level_1_in_reading'].value_counts(normalize=True)
read_int_lang_norm = read_int_lang_norm.mul(100)
read_int_lang_norm = read_int_lang_norm.rename('percent').reset_index()

# turn Plausible_level into an ordered category again
read_int_lang_norm.Plausible_level_1_in_reading = read_int_lang_norm.Plausible_level_1_in_reading.astype(ordered_reading)

sb.catplot(data=read_int_lang_norm, x='Plausible_level_1_in_reading', y='percent', hue='International_Language_at_Home', 
           kind='point');

It looks like there is a disadvantage for the students whose international language at home is different from the language they took the test in. I'll drop the "other language" rows and have a look again at the relation among math scores and language type.

In [140]:
# see how many rows I'm talking about
exploration_df.groupby('Language_type').International_Language_at_Home.value_counts()
Out[140]:
Language_type        International_Language_at_Home
deep orthography     Language of the test              17444
                     Other language                     4866
logographic          Language of the test              19290
                     Other language                      835
shallow orthography  Language of the test              19480
                     Other language                     4368
Name: International_Language_at_Home, dtype: int64
In [141]:
#subset where international language at home is the same of the test
home_test_lang_df = exploration_df[exploration_df.International_Language_at_Home=='Language of the test']
In [142]:
# distribution of math scores by language type (normalized) - language at home consistent with language of the test
# bars and connected point
math_language_norm2 = home_test_lang_df.groupby('Language_type')['Plausible_level_1_in_mathematics'].value_counts(normalize=True)
math_language_norm2 = math_language_norm2.mul(100)
math_language_norm2 = math_language_norm2.rename('percent').reset_index()

# turn Plausible_level into an ordered category again
math_language_norm2.Plausible_level_1_in_mathematics = math_language_norm2.Plausible_level_1_in_mathematics.astype(ordered_math)

sb.catplot(data=math_language_norm2, x='Plausible_level_1_in_mathematics',y='percent',hue='Language_type',
               hue_order=['deep orthography', 'shallow orthography', 'logographic'], kind='bar', palette='icefire')
sb.catplot(data=math_language_norm2, x='Plausible_level_1_in_mathematics',y='percent',hue='Language_type',
               hue_order=['deep orthography', 'shallow orthography', 'logographic'], kind='point', palette='icefire')
Out[142]:
<seaborn.axisgrid.FacetGrid at 0x23e33f4d6c8>
In [143]:
# let's see the difference in percentages
difference_without_intern_lang_rows = math_language_norm.copy()
difference_without_intern_lang_rows.percent = difference_without_intern_lang_rows.percent - math_language_norm2.percent

sb.catplot(data=difference_without_intern_lang_rows, x='Plausible_level_1_in_mathematics',y='percent',hue='Language_type',
               hue_order=['deep orthography', 'shallow orthography', 'logographic'], kind='point', linestyles='', palette='icefire')
Out[143]:
<seaborn.axisgrid.FacetGrid at 0x23e326a1708>

The distribution doesn't change much without the rows where the language at home is different from the language of the test. The percentages change in a range going from -0.3 to +0.35.

So, again, the student using a logographic language perform better, their distribution is definitely left skewed. Both the deep and shallow language group show a distribution of math scores skewed to the right, with most of the student in the 4 lowest categories. The lowest category, "Below 1" is the one where the two alphabetic groups differ most, and the deep orthography group seems to do worse between the two.

There may be many different reasons:

  • schooling system, is the first that comes to mind;
  • the writing system, being that different, could be a good candidate;
  • since the difference is mostly between asian and other races, it could even be that they are better at math for some reasons.

There is one area, Macao-China, for which the PISA2012 dataset records a lot of test administered in Cinese, and a smaller number in English.

In [144]:
print(clean_df[clean_df.Country=='Macao-China'].Language_of_the_test.value_counts())
print('\n')
print(clean_df[clean_df.Country=='Macao-China'].International_Language_at_Home.value_counts())
Chinese    4300
English     680
Name: Language_of_the_test, dtype: int64


Language of the test    4304
Other language           676
Name: International_Language_at_Home, dtype: int64
In [145]:
within_macao = clean_df[clean_df.Country=='Macao-China'].copy()
In [146]:
within_macao2 = within_macao[within_macao.International_Language_at_Home=='Language of the test']
# Chinese    4253
# English      51

within_macao2.groupby('Language_of_the_test').Language_at_home.value_counts()
Out[146]:
Language_of_the_test  Language_at_home                   
Chinese               Cantonese                              4056
                      Chinese dialects or languages (MAC)     127
                      Mandarin                                 70
English               English                                  51
Name: Language_at_home, dtype: int64
In [147]:
# math score by language at home
macao2 = within_macao2.groupby('Language_at_home')['Plausible_level_1_in_mathematics'].value_counts(normalize=True)
macao2 = macao2.mul(100)
macao2 = macao2.rename('percent').reset_index()

# turn Plausible_level into an ordered category again
macao2.Plausible_level_1_in_mathematics = macao2.Plausible_level_1_in_mathematics.astype(ordered_math)

sb.catplot(data=macao2, x='Plausible_level_1_in_mathematics', y='percent', hue='Language_at_home', hue_order=['English', 'Cantonese', 'Mandarin', 'Chinese dialects or languages (MAC)'], 
           kind='point', palette='bright');

The number of test in English is not large (51), however Mandarin speaking students here are not much more (70) and the distribution of their scores follows the larger sample Cantonese distribution (4056 students).

If anything, we can think that, since the distribution of math scores for the English student is different from the one of the larger dataset, language is not a barrier per se.

NOTE: unfortunately the school system in Macao-China "does not have a single centralised set of standards or curriculum. Individual schools follow different educational models, including Chinese, Portuguese, Hong Kong, and British systems.", nonetheless "the majority of the schools in Macau are grammar schools, which offer language learning, mathematics, science subjects, social studies, etc. to the pupils", therefore it is reasonable to think that these data don't come from vocational school students (vocational schools anyhow starts there after 15).

In [148]:
# 
sb.regplot(data=home_test_lang_df, x='Attitude_towards_School:_Learning_Activities',
            y='Cognitive_Activation_in_Mathematics_Lessons', scatter_kws={'alpha':.1})
Out[148]:
<matplotlib.axes._subplots.AxesSubplot at 0x23e386d9dc8>
In [149]:
sb.regplot(data=home_test_lang_df, x='Attitude_towards_School:_Learning_Activities',
            y='Index_of_economic_social_and_cultural_status', scatter_kws={'alpha':.1})
Out[149]:
<matplotlib.axes._subplots.AxesSubplot at 0x23e42d84e08>
In [150]:
sb.regplot(data=home_test_lang_df, x='Cognitive_Activation_in_Mathematics_Lessons',
            y='Index_of_economic_social_and_cultural_status', scatter_kws={'alpha':.01})
Out[150]:
<matplotlib.axes._subplots.AxesSubplot at 0x23d7e223f08>
In [151]:
# distribution of math scores by type of booklet (normalized) - language at home consistent with language of the test
# bars and connected point
math_booklet_norm = home_test_lang_df.groupby('Standard_or_simplified_set_of_booklets')['Plausible_level_1_in_mathematics'].value_counts(normalize=True)
math_booklet_norm = math_booklet_norm.mul(100)
math_booklet_norm = math_booklet_norm.rename('percent').reset_index()
# turn Plausible_level into an ordered category again
math_booklet_norm.Plausible_level_1_in_mathematics = math_booklet_norm.Plausible_level_1_in_mathematics.astype(ordered_math)

sb.catplot(data=math_booklet_norm, x='Plausible_level_1_in_mathematics',y='percent',hue='Standard_or_simplified_set_of_booklets',
               kind='bar')
sb.catplot(data=math_booklet_norm, x='Plausible_level_1_in_mathematics',y='percent',hue='Standard_or_simplified_set_of_booklets',
               kind='point')
Out[151]:
<seaborn.axisgrid.FacetGrid at 0x23e20471c08>

ATTENTION!!! The distribution of math scores of the students that solved the easier set of booklets looks a lot as the distribution of the scores of student of the two alphabetic language groups; The distribution of math scores of the students that solved the Standard set of booklets looks a lot as the distribution of the scores of student of the logographic language group.

In [152]:
# distribution of type  by language type (normalized) - language at home consistent with language of the test
# bars and connected point
langtype_booklet_norm = home_test_lang_df.groupby('Standard_or_simplified_set_of_booklets')['Language_type'].value_counts(normalize=True)
langtype_booklet_norm = langtype_booklet_norm.mul(100)
langtype_booklet_norm = langtype_booklet_norm.rename('percent').reset_index()

sb.catplot(data=langtype_booklet_norm, x='Language_type',y='percent',hue='Standard_or_simplified_set_of_booklets',
               kind='bar', palette='muted');
In [153]:
# percentage of booklet_types in each language group

#cat1
lang_type_order = ['shallow orthography', 'deep orthography', 'logographic']
#cat2
book_type_order = home_test_lang_df.Standard_or_simplified_set_of_booklets.unique().tolist()

# turn the variables into categories to avoid problems for zeros in the graph
booklet_category = pd.api.types.CategoricalDtype(ordered=False, categories=book_type_order)
home_test_lang_df.Standard_or_simplified_set_of_booklets = home_test_lang_df.Standard_or_simplified_set_of_booklets.astype(booklet_category)

langtype_category = pd.api.types.CategoricalDtype(ordered=True, categories=lang_type_order)
home_test_lang_df.Language_type = home_test_lang_df.Language_type.astype(langtype_category)

artists = [] # store reference to plot elements
baselines = np.zeros(len(lang_type_order))

lang_type_counts = home_test_lang_df.Language_type.value_counts()

# for each cat2 value
for i in range(len(book_type_order)):
    book_type = book_type_order[i]
    # get proportions of the cat1 values
    inner_counts = home_test_lang_df[home_test_lang_df.Standard_or_simplified_set_of_booklets==book_type]['Language_type'].value_counts()
    inner_props = inner_counts/lang_type_counts
    
    # and plot them on top of previous ones
    bars = plt.bar(x=np.arange(len(lang_type_order)), height=inner_props[lang_type_order], bottom=baselines)
    artists.append(bars)
    baselines += inner_props[lang_type_order]
    
plt.xticks(np.arange(len(lang_type_order)), lang_type_order)
plt.legend(reversed(artists), reversed(book_type_order), framealpha=1, loc=6)
C:\Users\annap\Anaconda3\lib\site-packages\pandas\core\generic.py:5208: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value
Out[153]:
<matplotlib.legend.Legend at 0x23e338e6f48>

The math scores are widely different between students that completed the standard or the simplified set of booklets, and the second type is present only in the two alphabetic language groups.

In [154]:
#home_test_lang_df[home_test_lang_df.Standard_or_simplified_set_of_booklets=='Standard set of booklets'].Language_type.value_counts()

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

  • Math levels vs reading levels: There is a clear positive correlation between test result in reading and test results in mathematics.

  • Math levels by ESCS: There looks to be a correlation here as well, and it is a bit sad: very bad levels in mathematics can be "achieved" by students from every socioeconomic and cultural background. on the other side, students living in the most disadvantaged condition are not presents in the top mathematics scores.

  • Math levels by gender (normalized): Girls seem to score a bit worst, generally: they are more than boys in the lower categories, and fewer than them in the 3 top categories.

  • math scores vs international language at home (normalized) and reading score vs international language at home (normalized): it looks like there is a disadvantage for the students whose international language at home is different from the language they took the test in. So, I dropped the "other language" rows and had a second look at the relation among math scores and language type.

  • Distribution of math scores by language type (normalized) - language at home consistent with language of the test: the students using a logographic language perform better, their distribution is definitely left skewed.

    Both the deep and shallow language group show a distribution of math scores skewed to the right, with most of the student in the 4 lowest categories. The lowest category, "Below 1" is the one where the two alphabetic groups differ most, and the deep orthography group seems to do worse between the two.

    There may be many different reasons:

    • schooling system, is the first that comes to mind;
    • the writing system, being that different, could be a good candidate;
    • since the difference is mostly between asian and other races, it could even be that they are better at math for some reasons.

Since I have an area, Macao-China, for which the PISA2012 dataset records a lot of test administered in Cinese, and a smaller number in English, I had a look at the distribution of the math score there: The number of test in English is not large (51), however Mandarin speaking students here are not much more (70) and the distribution of their scores follows the larger sample Cantonese distribution (4056 students).

If anything, we can think that, since the distribution of math scores for the English student is different from the one of the larger dataset, language is not a barrier per se.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

  • Contrary to math levels by gender, READING levels by gender suggests that girls are better than boys in this task.

  • reading and ESCS: the relationship between reading and ESCS is similar to math and ESCS, maybe a little less clean in the lowest reading levels and more defined in the upper ones.

  • ATTENTION!!! Type of booklets: the distribution of math scores of the students that solved the easier set of booklets looks a lot as the distribution of the scores of student of the two alphabetic language groups.

    The distribution of math scores of the students that solved the Standard set of booklets looks a lot as the distribution of the scores of student of the logographic language group.

    the second type is present only in the two alphabetic language groups

Multivariate Exploration

I'll have first a look at the distributions divided by booklet type

In [ ]:
 
In [155]:
# distribution of math scores by language type, divided by booklet type - language at home consistent with language of the test
g = sb.FacetGrid(home_test_lang_df, col='Standard_or_simplified_set_of_booklets', height=4, aspect=1.5, sharex=False, sharey=False)
g.map_dataframe(sb.countplot, 'Plausible_level_1_in_mathematics', hue='Language_type', palette=sb.color_palette()).add_legend()
Out[155]:
<seaborn.axisgrid.FacetGrid at 0x23e3aefec08>

!!!Look at distribution shape, height is not comparable among groups now (sample sizes differ a lot)!!!

Taking away the simplified books results, it looks that the distribution for the math scores of the two alphabetic groups improves a lot. The shallow ortography one looks normal. The deep ortography is still a bit heavier on the low result side. The logographic one obviously has not changed.

In [156]:
# distribution of MATH SCORES by LANGUAGE TYPE, divided by BOOKLET TYPE and by COUNTRY
# language at home consistent with language of the test

g = sb.FacetGrid(home_test_lang_df, col='Standard_or_simplified_set_of_booklets', row='Country', height=4, aspect=1.5, sharex=False, sharey=False)
g.map_dataframe(sb.countplot, 'Plausible_level_1_in_mathematics', hue='Language_type', palette=sb.color_palette()).add_legend();

Countries who chose to use the simplified version used ONLY that one. In those countries the results are all right skewed, with very bad results in general. A similar distribution in the Countries that used the Standard set of booklets is seen only in Qatar.

The "standard booklet" Countries distributions looks generally more symmetrical, with the mode varying between

  • 2 and 3, for the deep orthography languages,
  • 3 and 4, for the shallow orthography languages,
  • 3 and 4 again for the logographic languages, but in general they look more heavy on the better scores side

Two exceptions are:

  • China-Shanghai: where the distribution appears to increase linearly with the math score and
  • Singapore, whose teaching language is English, and whose distribution is more similar to the logographic group ones (and also to the China-Shanghai one).

A bad limit is that, not excluding rows with International language at home different from language of the test, and simplified set of booklet sooner, some interesting countries (as Singapore or Switzerland) have been left with few data.

ARGH! Lets START AGAIN!

In [157]:
# have a look at the initially selected Countries data
pisa_subset.head()
Out[157]:
CNT SUBNATIO STRATUM OECD NC SCHOOLID STIDSTD ST01Q01 ST02Q01 ST03Q01 ... PV4READ PV5READ PV1SCIE PV2SCIE PV3SCIE PV4SCIE PV5SCIE W_FSTUWT SENWGT_STU VER_STU
0 United Arab Emirates 7840000 ARE0762 Non-OECD United Arab Emirates 0000001 00001 10 1 08 ... 310.3345 241.3672 397.2771 387.9523 375.8299 365.5725 351.5852 2.8838 0.071 22NOV13
1 United Arab Emirates 7840000 ARE0762 Non-OECD United Arab Emirates 0000001 00002 10 1 06 ... 198.2226 243.1315 301.5106 267.0086 315.4979 289.3883 305.2406 2.8838 0.071 22NOV13
2 United Arab Emirates 7840000 ARE0762 Non-OECD United Arab Emirates 0000001 00003 10 1 09 ... 251.0707 322.4439 293.771 278.8512 200.5222 234.0918 327.3405 2.8838 0.071 22NOV13
3 United Arab Emirates 7840000 ARE0762 Non-OECD United Arab Emirates 0000001 00004 9 1 12 ... 214.1813 252.6746 329.2055 298.4334 281.6486 319.8806 318.9482 2.8838 0.071 22NOV13
4 United Arab Emirates 7840000 ARE0762 Non-OECD United Arab Emirates 0000001 00005 10 1 03 ... 301.6735 249.547 378.1611 405.2033 389.351 408.0007 410.7982 2.8838 0.071 22NOV13

5 rows × 535 columns

In [158]:
pisa_subset.shape
Out[158]:
(314831, 535)

pisa_variables that I'm keeping after the exploration done up to here:

  • 'Country code 3-character',
  • 'Student ID', NB: it starts from 1 for every Country, it needs to be coupled with the Country variable to be unique
  • 'Gender',
  • 'Language of the test',
  • 'Standard or simplified set of booklets',
  • 'Attitude towards School: Learning Activities',
  • 'Cognitive Activation in Mathematics Lessons',
  • 'Index of economic, social and cultural status',
  • 'Language at home (3-digit code)',
  • 'International Language at Home',
  • 'Plausible value 1 in mathematics',
  • 'Plausible value 1 in reading',
In [159]:
descriptions = ['Country code 3-character',
                'Student ID',
                'Gender',
                'Language of the test',
                'Standard or simplified set of booklets',
                'Attitude towards School: Learning Activities',
                'Cognitive Activation in Mathematics Lessons',
                'Index of economic, social and cultural status',
                'Language at home (3-digit code)',
                'International Language at Home',
                'Plausible value 1 in mathematics',
                'Plausible value 1 in reading']
In [160]:
#get the column names
columns_to_keep = pisa_variables[pisa_variables['x'].isin(descriptions)]['code'].tolist()
In [161]:
pisa_chosen_var = pisa_subset[columns_to_keep]
In [162]:
pisa_chosen_var
Out[162]:
CNT STIDSTD ST04Q01 ST25Q01 EASY ATTLNACT COGACT ESCS LANGN TESTLANG PV1MATH PV1READ
0 United Arab Emirates 00001 Male Language of the test Easier set of booklets -1.7015 -0.4593 0.65 Arabic Arabic 328.2521 313.5423
1 United Arab Emirates 00002 Male Language of the test Easier set of booklets -0.9394 3.2019 0.72 Arabic Arabic 270.7665 269.5957
2 United Arab Emirates 00003 Male Language of the test Easier set of booklets -0.9394 0.1015 -0.09 Arabic Arabic 314.9322 267.9116
3 United Arab Emirates 00004 Male Language of the test Easier set of booklets NaN NaN 1.08 Arabic Arabic 250.2804 217.389
4 United Arab Emirates 00005 Male Language of the test Easier set of booklets -0.9394 3.2019 -0.24 Arabic Arabic 310.1807 279.219
... ... ... ... ... ... ... ... ... ... ... ... ...
314826 Vietnam 04955 Female Language of the test Easier set of booklets NaN NaN -2.08 Vietnamese Vietnamese 477.1849 460.2272
314827 Vietnam 04956 Male Language of the test Easier set of booklets -0.3397 -0.6041 -1.22 Vietnamese Vietnamese 518.936 490.9325
314828 Vietnam 04957 Male Language of the test Easier set of booklets -1.4236 -0.7532 -1.92 Vietnamese Vietnamese 475.2376 462.6239
314829 Vietnam 04958 Male Language of the test Easier set of booklets 0.0873 0.2435 -0.71 Vietnamese Vietnamese 550.9503 505.2873
314830 Vietnam 04959 Female Language of the test Easier set of booklets NaN NaN -1.16 Vietnamese Vietnamese 470.0187 532.3506

314831 rows × 12 columns

Assess, Clean and Test Again

In [163]:
pisa_chosen_var.shape
Out[163]:
(314831, 12)
In [164]:
pisa_chosen_var.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 314831 entries, 0 to 314830
Data columns (total 12 columns):
CNT         314831 non-null object
STIDSTD     314831 non-null object
ST04Q01     314831 non-null object
ST25Q01     300461 non-null object
EASY        314831 non-null object
ATTLNACT    202093 non-null object
COGACT      204216 non-null object
ESCS        309909 non-null object
LANGN       312263 non-null object
TESTLANG    314218 non-null object
PV1MATH     314831 non-null object
PV1READ     314831 non-null object
dtypes: object(12)
memory usage: 28.8+ MB

ISSUES:

Missing data: will disappear

It is not data that I can retrieve somewhere, therefore, after a bit of cleaning I will assess the columns I want to use and decide what to do.

Tidiness

The dataset looks OK.

Quality
  • columns name need to be more informative
  • all variables are string, it is not correct.
  • plausible values: need to be converted in plausible levels? Not so sure now (here some info), reported in the next cell

I'll fix these quality issue and then look for more

In [165]:
# make a copy
df1 = pisa_chosen_var.copy() # again, but df is a convenient name
print('Done!')
Done!

I'll deal first with the plausible value columns (both conversion to level and levels to categories)

The column names (from Excel) are:

mathematics: 'PV1MATH'

reading: 'PV1READ'

In [166]:
# convert the columns from strings to float
all_PV2 = ['PV1MATH', 'PV1READ']

for col in all_PV2:
    df1[col] = df1[col].astype('float')
    
print('Done!')
Done!
In [167]:
# verify dtype
df1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 314831 entries, 0 to 314830
Data columns (total 12 columns):
CNT         314831 non-null object
STIDSTD     314831 non-null object
ST04Q01     314831 non-null object
ST25Q01     300461 non-null object
EASY        314831 non-null object
ATTLNACT    202093 non-null object
COGACT      204216 non-null object
ESCS        309909 non-null object
LANGN       312263 non-null object
TESTLANG    314218 non-null object
PV1MATH     314831 non-null float64
PV1READ     314831 non-null float64
dtypes: float64(2), object(10)
memory usage: 28.8+ MB
In [168]:
# convert the values for mathematics and turn into categories INTO A NEW COLUMN
print('Working on this..')

# apply conversion

df1['Plausible_level_math'] = df1['PV1MATH'].apply(conversion_math)
df1['Plausible_level_math'] = df1['Plausible_level_math'].astype(ordered_math)

print('Done!')
Working on this..
Done!
In [169]:
# convert the values for reading and turn into categories INTO A NEW COLUMN

print('Working on this..')

# apply
df1['Plausible_level_reading'] = df1['PV1READ'].apply(conversion_reading)
df1['Plausible_level_reading'] = df1['Plausible_level_reading'].astype(ordered_reading)

print('Done!')
Working on this..
Done!
In [170]:
df1.head()
Out[170]:
CNT STIDSTD ST04Q01 ST25Q01 EASY ATTLNACT COGACT ESCS LANGN TESTLANG PV1MATH PV1READ Plausible_level_math Plausible_level_reading
0 United Arab Emirates 00001 Male Language of the test Easier set of booklets -1.7015 -0.4593 0.65 Arabic Arabic 328.2521 313.5423 Below 1 1b
1 United Arab Emirates 00002 Male Language of the test Easier set of booklets -0.9394 3.2019 0.72 Arabic Arabic 270.7665 269.5957 Below 1 1b
2 United Arab Emirates 00003 Male Language of the test Easier set of booklets -0.9394 0.1015 -0.09 Arabic Arabic 314.9322 267.9116 Below 1 1b
3 United Arab Emirates 00004 Male Language of the test Easier set of booklets NaN NaN 1.08 Arabic Arabic 250.2804 217.3890 Below 1 Too low
4 United Arab Emirates 00005 Male Language of the test Easier set of booklets -0.9394 3.2019 -0.24 Arabic Arabic 310.1807 279.2190 Below 1 1b
In [171]:
df1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 314831 entries, 0 to 314830
Data columns (total 14 columns):
CNT                        314831 non-null object
STIDSTD                    314831 non-null object
ST04Q01                    314831 non-null object
ST25Q01                    300461 non-null object
EASY                       314831 non-null object
ATTLNACT                   202093 non-null object
COGACT                     204216 non-null object
ESCS                       309909 non-null object
LANGN                      312263 non-null object
TESTLANG                   314218 non-null object
PV1MATH                    314831 non-null float64
PV1READ                    314831 non-null float64
Plausible_level_math       314831 non-null category
Plausible_level_reading    314831 non-null category
dtypes: category(2), float64(2), object(10)
memory usage: 29.4+ MB

Column names

In [172]:
# all my chosen column descriptions are in the var description, the name are in the var columns_to_keep
# unfortunately I cannot zip them, because the order is not the same

# create a subset of pisa_variables
my_columns = pisa_variables[pisa_variables['code'].isin(columns_to_keep)].copy()
In [173]:
# clean the descriptions a little bit
# sorry for the superlong lines, but .replace() was not working and I decided to look for the reason at a later moment
my_columns['x'] = my_columns.x.str.replace(' - ', '_').str.replace(' ','_').str.replace("'s",'').str.replace('-','_').str.replace('<','_').str.replace('>','').str.replace(',','').str.replace('__','_')
my_columns['x'] = my_columns.x.str.replace('_value_1_', '_value_').str.replace('content_subscale_of_','').str.replace('process_subscale_of_','').str.replace('_code_3_character', '').str.replace('_.3_digit_code.', '')
In [174]:
# create a dictionary of the current column names and the cleaned descriptions, set the latter as new names
columns_dict = dict(zip(my_columns['code'], my_columns['x']))

df1.rename(columns=columns_dict, inplace=True)

# check up
df1.head()
Out[174]:
Country Student_ID Gender International_Language_at_Home Standard_or_simplified_set_of_booklets Attitude_towards_School:_Learning_Activities Cognitive_Activation_in_Mathematics_Lessons Index_of_economic_social_and_cultural_status Language_at_home Language_of_the_test Plausible_value_in_mathematics Plausible_value_in_reading Plausible_level_math Plausible_level_reading
0 United Arab Emirates 00001 Male Language of the test Easier set of booklets -1.7015 -0.4593 0.65 Arabic Arabic 328.2521 313.5423 Below 1 1b
1 United Arab Emirates 00002 Male Language of the test Easier set of booklets -0.9394 3.2019 0.72 Arabic Arabic 270.7665 269.5957 Below 1 1b
2 United Arab Emirates 00003 Male Language of the test Easier set of booklets -0.9394 0.1015 -0.09 Arabic Arabic 314.9322 267.9116 Below 1 1b
3 United Arab Emirates 00004 Male Language of the test Easier set of booklets NaN NaN 1.08 Arabic Arabic 250.2804 217.3890 Below 1 Too low
4 United Arab Emirates 00005 Male Language of the test Easier set of booklets -0.9394 3.2019 -0.24 Arabic Arabic 310.1807 279.2190 Below 1 1b

Datatypes

In [175]:
df1.International_Language_at_Home.unique()
Out[175]:
array(['Language of the test', nan, 'Other language'], dtype=object)
In [176]:
df1.Standard_or_simplified_set_of_booklets.unique()
Out[176]:
array(['Easier set of booklets', 'Standard set of booklets'], dtype=object)
In [177]:
# drop all the rows where booklets are the simplified ones
df1 = df1.query('Standard_or_simplified_set_of_booklets=="Standard set of booklets"')

# drop the column
df1 = df1.drop('Standard_or_simplified_set_of_booklets', axis=1)
In [178]:
df1.shape
Out[178]:
(222370, 13)
In [179]:
# Attitude_towards_School:_Learning_Activities converted into float
df1['Attitude_towards_School:_Learning_Activities'] = df1['Attitude_towards_School:_Learning_Activities'].astype(float)
In [180]:
# Cognitive activation and ESCS index to float
df1['Cognitive_Activation_in_Mathematics_Lessons'] = df1['Cognitive_Activation_in_Mathematics_Lessons'].astype(float)
df1['Index_of_economic_social_and_cultural_status'] = df1['Index_of_economic_social_and_cultural_status'].astype(float)
In [181]:
df1['Language_at_home'].unique()
Out[181]:
array(['English', 'Another language (AUS) ', 'Missing', nan, 'Italian',
       'German ', 'Cantonese', 'Hindi', 'Invalid', 'Mandarin ', 'Greek',
       'Vietnamese ', 'Spanish', 'Australian languages ', 'Arabic ',
       'Turkish', 'Serbian', 'Polish ', 'Another language (AUT) ',
       'Bosnian', 'Romanian ', 'Croatian ', 'Macedonian ', 'Albanian ',
       'Russian', 'Kurdish', 'French ', 'Another language (BEL) ',
       'Eastern European languages ', 'Western European languages ',
       'Flemish dialect (BEL)', 'Dutch', 'Walloon',
       'German dialect (BEL) ', 'Another language (CAN) ',
       'Swiss German ', 'Another language (CHE) ',
       'Yugoslavian - Serbian, Croatian, etc ', 'Swiss Italian',
       'Portuguese ', "Raeto'-Romance ", 'German (LIE) ',
       'Another language (DEU) ', 'Basque ', 'Another language (ESP) ',
       'Catalan', 'Galician ', 'Valencian', 'Finnish', 'Somali ',
       'Another language (FIN) ', 'Estonian ', 'Chinese', 'Thai ',
       'Swedish', 'Romani ', 'Another language (FRA) ',
       'Regional languages (FRA) ', 'Another language (QUK) ', 'Welsh',
       'Another language (QSC) ', 'Ulster Scots ', 'Irish',
       'Other European languages (QSC) ', 'Scottish Gaelic',
       'Another language (HKG) ', 'Chinese dialects or languages (HKG)',
       'Another language (IRL) ', 'Another EU language (ITA)',
       'A dialect (ITA)', 'Another language (ITA) ',
       'Another official language (ITA)', 'Slovenian', 'Japanese ',
       'Another language (JPN) ', 'Korean ', 'Luxembourgish',
       'Another language (LUX) ', 'Another language (MAC) ',
       'Chinese dialects or languages (MAC)', 'Another language (NZL) ',
       'Maori', 'Samoan ', 'Another language (QAT) ', 'Shanghai dialect ',
       'Another local Chinese dialect (QCN)', 'Another language (QCN) ',
       'Another language (USA) ', 'Tamil', 'Malay',
       'Another language (SGP) ', 'Taiwanese dialect',
       'Hakka dialect (TAP)', 'Another language (TAP) ',
       'Aboriginal dialect (TAP) '], dtype=object)
In [182]:
# clean the trailing whitespaces
df1.Language_at_home = df1.Language_at_home.str.strip()
In [183]:
# Language_of_test: check the values 
df1.Language_of_the_test.unique()
Out[183]:
array(['English', 'German ', 'French ', 'Dutch', nan, 'Italian',
       'Spanish', 'Catalan', 'Basque ', 'Valencian', 'Finnish', 'Swedish',
       'Welsh', 'Cantonese', 'Irish', 'Slovenian', 'Japanese ', 'Korean ',
       'Chinese', 'Portuguese ', 'Arabic ',
       'Hybrid - English + Arabic (QAT)', 'Shanghai dialect ',
       'Mandarin '], dtype=object)

Language of the test is ok as string, but I can drop the rows where the test is not in one of my chosen languages. The groups are:

  • shallow_ortography: Spanish, Finnish, Italian, German

  • deep_ortography: English, French, Arabic

  • logographic: Chinese, Japanese, Korean To this I will add Shanghai dialect, Mandarin and Cantonese, because are all written with chinese characters.

Since English and Arabic belong to the same group, I will keep the "Hybrid - English + Arabic (QAT)" group, renaming it as 'English_Arabic'.

In [184]:
shallow_ortography =['Spanish', 'Finnish', 'Italian', 'German']
deep_ortography = ['English', 'French', 'Arabic', 'English_Arabic']
logographic = ['Chinese', 'Japanese', 'Korean', 'Shanghai dialect', 'Mandarin', 'Cantonese']

test_lang_to_keep = shallow_ortography + deep_ortography + logographic
In [185]:
# clean the test language labels
df1.Language_of_the_test = df1.Language_of_the_test.str.strip().str.replace('Hybrid.*', 'English_Arabic')
In [186]:
# check values
df1.Language_of_the_test.unique()
Out[186]:
array(['English', 'German', 'French', 'Dutch', nan, 'Italian', 'Spanish',
       'Catalan', 'Basque', 'Valencian', 'Finnish', 'Swedish', 'Welsh',
       'Cantonese', 'Irish', 'Slovenian', 'Japanese', 'Korean', 'Chinese',
       'Portuguese', 'Arabic', 'English_Arabic', 'Shanghai dialect',
       'Mandarin'], dtype=object)
In [187]:
# drop all rows where the test is not in test_lang_to_keep
df1 = df1.query('Language_of_the_test in @test_lang_to_keep')
In [188]:
df1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 211108 entries, 17408 to 309871
Data columns (total 13 columns):
Country                                         211108 non-null object
Student_ID                                      211108 non-null object
Gender                                          211108 non-null object
International_Language_at_Home                  199969 non-null object
Attitude_towards_School:_Learning_Activities    136783 non-null float64
Cognitive_Activation_in_Mathematics_Lessons     137546 non-null float64
Index_of_economic_social_and_cultural_status    207533 non-null float64
Language_at_home                                209461 non-null object
Language_of_the_test                            211108 non-null object
Plausible_value_in_mathematics                  211108 non-null float64
Plausible_value_in_reading                      211108 non-null float64
Plausible_level_math                            211108 non-null category
Plausible_level_reading                         211108 non-null category
dtypes: category(2), float64(5), object(6)
memory usage: 19.7+ MB

I still need to drop the nulls from

  • International_Language_at_Home
  • Index_of_economic_social_and_cultural_status
In [189]:
# drop rows where the language at home is not the language of the test
df1 = df1.query('International_Language_at_Home=="Language of the test"')

# and drop the column
df1.drop('International_Language_at_Home', axis=1, inplace=True)
In [190]:
# drop rows without ESCS index
df1 = df1.dropna(subset=['Index_of_economic_social_and_cultural_status'])
df1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 166912 entries, 17408 to 309871
Data columns (total 12 columns):
Country                                         166912 non-null object
Student_ID                                      166912 non-null object
Gender                                          166912 non-null object
Attitude_towards_School:_Learning_Activities    109586 non-null float64
Cognitive_Activation_in_Mathematics_Lessons     110145 non-null float64
Index_of_economic_social_and_cultural_status    166912 non-null float64
Language_at_home                                166912 non-null object
Language_of_the_test                            166912 non-null object
Plausible_value_in_mathematics                  166912 non-null float64
Plausible_value_in_reading                      166912 non-null float64
Plausible_level_math                            166912 non-null category
Plausible_level_reading                         166912 non-null category
dtypes: category(2), float64(5), object(5)
memory usage: 14.3+ MB
In [191]:
df1[['Plausible_value_in_mathematics', 'Plausible_value_in_reading']].describe()
Out[191]:
Plausible_value_in_mathematics Plausible_value_in_reading
count 166912.000000 166912.000000
mean 512.334852 512.147424
std 99.624082 96.222315
min 93.636100 22.356000
25% 445.793700 452.363600
50% 513.639200 518.840200
75% 581.251000 579.453200
max 962.229300 904.802600
In [192]:
# math and reading plausible values

plt.figure(figsize=(10,6))

plt.subplot(1,2,1)
plt.hist(df1.Plausible_value_in_mathematics, color=sb.color_palette()[2], bins=120);

plt.subplot(1,2,2)
plt.hist(df1.Plausible_value_in_reading, color=sb.color_palette()[3], bins=120);
In [193]:
# math and reading plausible levels
plt.figure(figsize=(10,6))

plt.subplot(1,2,1)
sb.countplot(df1.Plausible_level_math, color=sb.color_palette()[2]);

plt.subplot(1,2,2)
sb.countplot(df1.Plausible_level_reading, color=sb.color_palette()[3]);

MATHEMATICS

With the dataset better cleaned from the start, the math values and levels appear to be normally distributed, with the mode at 3 for the levels, and a mean around 500 for the values

READING

The distribution of the reading level has a mode of 3 as well, and the mean of the reading values is around 500, as the one for math values, but these two distributions are slightly left-skewed.

Other variables

CATEGORICAL variables

In [194]:
# CATEGORIES:
categories_to_plot = ['Country', 'Language_of_the_test', 'Language_at_home', 'Gender']

# create the grid
fig, axes = plt.subplots(2,2, figsize=(20,20))
axes = axes.flatten()

for i in range(4):
    plt.sca(axes[i])
    col = categories_to_plot[i] # column to plot
    sb.countplot(data=df1, y=col, color=sb.color_palette()[i])
    plt.xticks(rotation=20)

Country

  • Country: the most represented is Italy (more than 20k students each), followed by Spain and Canada (above 15k each). Between 10k and 15k there are Australia and the UK. All other Countries are under 10k.

Language

  • Language_of_the_test: English is the most represented, followed by Italian. I need again to regroup them into my 3 language-type categories, still the logographic group is the less numerous.

  • Language_at_home: a reasonable number of languages survived the last cleaning.

Gender

  • The students in the selected dataset are about 50% male and 50% female.

Language_of_the_test

Let's re-regroup the languages into my main three categories, which are already summarized into these variables:

shallow_ortography =['Spanish', 'Finnish', 'Italian', 'German']
deep_ortography = ['English', 'French', 'Arabic', 'English_Arabic']
logographic = ['Chinese', 'Japanese', 'Korean', 'Shanghai dialect', 'Mandarin', 'Cantonese']
In [195]:
# create a column "Language_type" and populate it on the base of the language_of_the_test variable

df1.loc[df1.Language_of_the_test.isin(shallow_ortography), 'Language_type'] = 'shallow orthography'
df1.loc[df1.Language_of_the_test.isin(deep_ortography), 'Language_type'] = 'deep orthography'
df1.loc[df1.Language_of_the_test.isin(logographic), 'Language_type'] = 'logographic'
In [196]:
# check my new column
df1.Language_type.value_counts()
Out[196]:
deep orthography       76009
shallow orthography    61703
logographic            29200
Name: Language_type, dtype: int64
In [197]:
# plot it
labels = []
for a, b in zip(df1.Language_type.value_counts().index, df1.Language_type.value_counts().values/df1.Language_type.count()*100):
    labels.append(a + '\n' + '{:.2f}'.format(b) + '%')
plt.pie(df1.Language_type.value_counts(), labels=labels, startangle=-74, counterclock=True);
plt.axis('square');

Logographic languages still account for a much smaller proportion of the overall data than the other two language groups.

NUMERIC variables

In [198]:
# Index_of_economic_social_cultural_status (ESCS): 

fig, axs = plt.subplots(1,3, figsize=(20,5))
axs = axs.flatten()
bins = [20, 40, 80]

for i in range(len(bins)):
    plt.sca(axs[i])
    plt.hist(data=df1, x='Index_of_economic_social_and_cultural_status', bins=bins[i]);
    plt.title('ESCS index distribution: {} bins'.format(bins[i]))
In [199]:
ESCS0_below = (df1.Index_of_economic_social_and_cultural_status < 0).sum()
ESCS0_above = (df1.Index_of_economic_social_and_cultural_status >= 0).sum()
print('students below ESCS 0: {};\nstudents at or above ESCS 0: {}\ndifference: {}'.format(ESCS0_below, ESCS0_above, (ESCS0_below-ESCS0_above)))

ESCS_minus2_below = (df1.Index_of_economic_social_and_cultural_status < -2).sum()
ESCS_plus2_above = (df1.Index_of_economic_social_and_cultural_status > 2).sum()
print('\nstudents below ESCS -2: {};\nstudents at or above ESCS +2: {}\ndifference: {}'.format(ESCS_minus2_below, ESCS_plus2_above, (ESCS_minus2_below-ESCS_plus2_above)))

print('\nnumber of NaNs: ', df1.Index_of_economic_social_and_cultural_status.isna().sum())
students below ESCS 0: 74854;
students at or above ESCS 0: 92058
difference: -17204

students below ESCS -2: 2013;
students at or above ESCS +2: 1051
difference: 962

number of NaNs:  0

ESCS distribution is left skewed. If we take 0 as a neutral point, were students are "OK" (not priviledged, but neither disadvantaged), then there are 17204 more student on the disadvantaged side than in the advantaged one. Before the last cleaning they were 20135 About the extremes, for ESCS < -2 now there are 962 students more than for ESCS >+2. Before there were 17777 of them.

The distribution is slightly less skewed, two possible reasons:

  • removal of all Countries that chose the simplified set of booklets (main reason probably)
  • removal of all the students whose language at home was different of the language of the test, i.e. mainly immigrants
In [200]:
# Attitude_towards_School:_Learning_Activities
fig, axs = plt.subplots(1,3, figsize=(20,5))
axs = axs.flatten()
bins = [10, 20, 40]

for i in range(len(bins)):
    plt.sca(axs[i])
    plt.hist(data=df1, x='Attitude_towards_School:_Learning_Activities', bins=bins[i]);
    plt.title('Attitude_towards_School:_Learning_Activities: {} bins'.format(bins[i]))
C:\Users\annap\Anaconda3\lib\site-packages\numpy\lib\histograms.py:839: RuntimeWarning: invalid value encountered in greater_equal
  keep = (tmp_a >= first_edge)
C:\Users\annap\Anaconda3\lib\site-packages\numpy\lib\histograms.py:840: RuntimeWarning: invalid value encountered in less_equal
  keep &= (tmp_a <= last_edge)
In [201]:
# Cognitive_Activation_in_Mathematics_Lessons
fig, axs = plt.subplots(1,3, figsize=(20,5))
axs = axs.flatten()
bins = [10, 20, 40]

for i in range(len(bins)):
    plt.sca(axs[i])
    plt.hist(data=df1, x='Cognitive_Activation_in_Mathematics_Lessons', bins=bins[i]);
    plt.title('Cognitive_Activation_in_Mathematics_Lessons: {} bins'.format(bins[i]))
In [202]:
# unique values, NaNs and their row "Language of test" values
print(df1.Cognitive_Activation_in_Mathematics_Lessons.isna().sum())
print(df1.loc[df1.Cognitive_Activation_in_Mathematics_Lessons.isna()]['Language_of_the_test'].value_counts())
56767
English             19321
Italian              7770
Spanish              6468
French               4756
German               4589
Finnish              2045
Arabic               2031
Japanese             2010
Shanghai dialect     1673
Korean               1656
Mandarin             1610
Chinese              1432
Cantonese            1406
Name: Language_of_the_test, dtype: int64
In [203]:
# the language_of_the_test values:
df1.Language_of_the_test.value_counts()
Out[203]:
English             56879
Italian             23103
Spanish             19242
French              13797
German              13404
Japanese             6059
Finnish              5954
Arabic               5333
Shanghai dialect     5054
Korean               4954
Mandarin             4695
Chinese              4281
Cantonese            4157
Name: Language_of_the_test, dtype: int64

The last two variables have a lot of NaNs, again many of them in the two bigger language-type groups, but a significative number also in the logographic group. Since I'm not using these two variables, they will remain like this.

In [204]:
df1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 166912 entries, 17408 to 309871
Data columns (total 13 columns):
Country                                         166912 non-null object
Student_ID                                      166912 non-null object
Gender                                          166912 non-null object
Attitude_towards_School:_Learning_Activities    109586 non-null float64
Cognitive_Activation_in_Mathematics_Lessons     110145 non-null float64
Index_of_economic_social_and_cultural_status    166912 non-null float64
Language_at_home                                166912 non-null object
Language_of_the_test                            166912 non-null object
Plausible_value_in_mathematics                  166912 non-null float64
Plausible_value_in_reading                      166912 non-null float64
Plausible_level_math                            166912 non-null category
Plausible_level_reading                         166912 non-null category
Language_type                                   166912 non-null object
dtypes: category(2), float64(5), object(6)
memory usage: 20.6+ MB
In [205]:
# make a copy (reordered, without Cognitive activation and Attitude towards school)
exploration_df1 = df1[['Country', 'Student_ID', 'Gender', 'Index_of_economic_social_and_cultural_status', 'Language_at_home', 'Language_of_the_test', 'Language_type', 'Plausible_value_in_mathematics', 'Plausible_value_in_reading', 'Plausible_level_math', 'Plausible_level_reading']].copy()

Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

  • Country : the most represented is Italy (more than 20k students each), followed by Spain and Canada (above 15k each). Between 10k and 15k there are Australia and the UK. All other Countries are under 10k.;
  • Student_ID
  • Gender : about 50% females, 50% males
  • Index_of_economic_social_and_cultural_status : ESCS distribution is left skewed (less than before, probably because of the absence of the Countries that choose the simplified booklets). If we take 0 as a neutral point, were students are "OK" (not priviledged, but neither disadvantaged), then there are 17204 more student on the disadvantaged side than in the advantaged one. Before the last cleaning they were 20135 About the extremes, for ESCS < -2 now there are 962 students more than for ESCS >+2. Before there were 17777 of them.
  • Language_at_home : a reasonable number of languages survived the last cleaning (read: many less than before);
  • Language_of_the_test : English is the most represented, followed by Italian. Then we have Spanish, French and German. All other account for less than 8k rows each;
  • Language_type : my new variable: three categories "shallow_ortography" (Spanish, Finnish, Italian, German), "deep_ortography" (English, French, Arabic, English_Arabic), "logographic" (Chinese, Japanese, Korean, Shanghai dialect, Mandarin, Cantonese)
  • Plausible_value_in_mathematics and Plausible_level_in_math : with the dataset better cleaned from the start, the math values and levels appear to be normally distributed, with the mode at 3 for the levels, and a mean around 500 for the values
  • Plausible_value_in_reading and Plausible_level_in_reading : The distribution of the reading level has a mode of 3 as well, and the mean of the reading values is around 500, as the one for math values, but these two distributions are slightly left-skewed.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Data were difficult to interpret because of the presence of the countries that chose to use only the Simplified set of booklets. I had initially kept them, because they had been weighted so that the results of those Countries could be comparable with the rest of the data. It is indeed possible to use the data all together and with the correct weight for each line) to judge the performance of the schooling system of each Country compared to the others. However I am regrouping the data by a different criterion (the language type) and therefore including the Countries whose schooling system has evidently a markedly different effect on the performance of their students (and, in fact, those Countries choose the simplified booklets) was simply adding complexity (and a variable that needed to be highlighted).

On the same line, I dropped directly the rows where the language at home was not the language of the test.


Along vith the Plausible_levels variables this time I kept the original Plausible_values given in the dataset PISA2012. It is true that the levels are easier to interpret, but the values are a numeric variable that can be useful for the visualizations.


This time I did not trim the data to try and have sample of the same size, because I faound that it can be useful to explore within Countries as well, to have an idea of the effect of the schooling system.

In [206]:
# Language_type by Language_of_the_test and Country
exploration_df1.groupby(['Language_type','Language_of_the_test'])['Country'].value_counts()
Out[206]:
Language_type        Language_of_the_test  Country                 
deep orthography     Arabic                Qatar                        5333
                     English               Canada                      13382
                                           Australia                   12385
                                           United Kingdom              11197
                                           Ireland                      4571
                                           United States of America     4179
                                           New Zealand                  3474
                                           Singapore                    2312
                                           Florida (USA)                1577
                                           Connecticut (USA)            1467
                                           Massachusetts (USA)          1466
                                           Qatar                         722
                                           Luxembourg                     83
                                           Macao-China                    51
                                           Hong Kong-China                13
                     French                France                       3990
                                           Canada                       3509
                                           Switzerland                  3444
                                           Belgium                      2399
                                           Luxembourg                    455
logographic          Cantonese             Hong Kong-China              4157
                     Chinese               Macao-China                  4281
                     Japanese              Japan                        6059
                     Korean                Korea                        4954
                     Mandarin              Chinese Taipei               4695
                     Shanghai dialect      China-Shanghai               5054
shallow orthography  Finnish               Finland                      5954
                     German                Switzerland                  4983
                                           Austria                      3871
                                           Germany                      3670
                                           Belgium                       494
                                           Liechtenstein                 238
                                           Luxembourg                    118
                                           Italy                          30
                     Italian               Italy                       22799
                                           Switzerland                   304
                     Spanish               Spain                       19242
Name: Country, dtype: int64
In [207]:
exploration_df1.Language_type.value_counts()
Out[207]:
deep orthography       76009
shallow orthography    61703
logographic            29200
Name: Language_type, dtype: int64
In [208]:
exploration_df1.groupby(['Language_type'])['Language_of_the_test'].value_counts()
Out[208]:
Language_type        Language_of_the_test
deep orthography     English                 56879
                     French                  13797
                     Arabic                   5333
logographic          Japanese                 6059
                     Shanghai dialect         5054
                     Korean                   4954
                     Mandarin                 4695
                     Chinese                  4281
                     Cantonese                4157
shallow orthography  Italian                 23103
                     Spanish                 19242
                     German                  13404
                     Finnish                  5954
Name: Language_of_the_test, dtype: int64
In [209]:
# a quick look at the univariate distribution of the variables, side to side, I want to focus on

plt.figure(figsize=(20,4))

plt.subplot(1,3,1)
plt.hist(data=exploration_df1, x='Plausible_value_in_mathematics', bins=120, color=sb.color_palette()[2]);

plt.subplot(1,3,2)
plt.hist(data=exploration_df1, x='Plausible_value_in_reading', bins=120, color=sb.color_palette()[0]);

plt.subplot(1,3,3)
plt.hist(data=exploration_df1, x='Index_of_economic_social_and_cultural_status', bins=120, color=sb.color_palette()[4]);
plt.title('ESCS index distribution: {} bins'.format('40'))
Out[209]:
Text(0.5, 1.0, 'ESCS index distribution: 40 bins')
In [210]:
exploration_df1[['Plausible_value_in_mathematics', 'Plausible_value_in_reading', 'Index_of_economic_social_and_cultural_status']].describe()
Out[210]:
Plausible_value_in_mathematics Plausible_value_in_reading Index_of_economic_social_and_cultural_status
count 166912.000000 166912.000000 166912.000000
mean 512.334852 512.147424 0.092819
std 99.624082 96.222315 0.918376
min 93.636100 22.356000 -5.320000
25% 445.793700 452.363600 -0.560000
50% 513.639200 518.840200 0.130000
75% 581.251000 579.453200 0.800000
max 962.229300 904.802600 3.120000

Plausible values in reading and ESCS index distributions are different in the shape, with the latter presenting std (in scale), but they somehow look very similar in their left-skewedness (if you can say so)

Bivariate exploration - another round

After better cleaning the data, we can move on! :)

In [211]:
exploration_df1.columns
Out[211]:
Index(['Country', 'Student_ID', 'Gender',
       'Index_of_economic_social_and_cultural_status', 'Language_at_home',
       'Language_of_the_test', 'Language_type',
       'Plausible_value_in_mathematics', 'Plausible_value_in_reading',
       'Plausible_level_math', 'Plausible_level_reading'],
      dtype='object')
In [212]:
# lets look fist to a possible correlation among the math scale and subscales and the reading scale

sb.countplot(data=exploration_df1, x='Plausible_level_math', hue='Plausible_level_reading', palette='mako_r');
In [213]:
sb.regplot(data=exploration_df1, x='Plausible_value_in_mathematics',
           y='Plausible_value_in_reading', scatter_kws={'alpha':.01});

There is a clear positive correlation between test result in reading and test results in mathematics.

The scatterplot could be improved sampling the data.

In [214]:
# cut the ESCS index into bands and create a new category column
min_ESCS = exploration_df1.Index_of_economic_social_and_cultural_status.min()
max_ESCS =  exploration_df1.Index_of_economic_social_and_cultural_status.max()
band_limits = np.linspace(min_ESCS, max_ESCS, num=10, endpoint=True)

# create the category
ESCS_bands = ['-5.32 to -4.38', '-4.38 to -3.44', '-3.44 to -2.51',
              '-2.51 to -1.57', '-1.57 to -0.63', '-0.63 to 0.31',
              '0.31 to 1.24', '1.24 to 2.18', '2.18 to 3.12']
ESCS_bands_order = pd.api.types.CategoricalDtype(ordered=True, categories=ESCS_bands)

def numerical_to_category(x, band_limits=[0,1,2,3,4,5]):
    for i in range(len(band_limits)-1):
        if band_limits[i] <= x < band_limits[i+1]:
            return '{:.2f} to {:.2f}'.format(band_limits[i], band_limits[i+1])
        elif x == band_limits[-1]:
            return '{:.2f} to {:.2f}'.format(band_limits[-2], band_limits[-1])

# apply to the ESCS index column and create a new column
exploration_df1['ESCS_levels'] = exploration_df1.Index_of_economic_social_and_cultural_status.apply(numerical_to_category, band_limits=band_limits)
# turn into ordered category
exploration_df1.ESCS_levels = exploration_df1.ESCS_levels.astype(ESCS_bands_order)
In [215]:
# ESCS index by math level 
g = sb.FacetGrid(data=exploration_df1, col='Plausible_level_math', col_wrap=4)
g.map(plt.hist, 'Index_of_economic_social_and_cultural_status', bins=40);
In [216]:
# math level by ESCS again
sb.violinplot(data=exploration_df1, x='ESCS_levels', y='Plausible_value_in_mathematics', color=sb.color_palette()[4], inner='quartile');
sb.boxplot(data=exploration_df1, x='ESCS_levels', y='Plausible_value_in_mathematics', color='white');

If, we plot the math levels by the ESCS index, we see the curve starting slightly right-skewed for math level 'below 1', going normal and then bending on the other side, with an ESCS index mean that increases, as we move up the math levels. When we reach level 6 it is definitely left-skewed.

There looks to be a correlation here, and it is a bit sad: very bad levels in mathematics can be "achieved" by students from every socioeconomic and cultural background (but actually, students in the lowest ESCS level are not the worst achievers). On the other side, students living in the most disadvantaged condition are not presents in the top mathematics scores.

In [217]:
# ESCS index by reading levels
g = sb.FacetGrid(data=exploration_df1, col='Plausible_level_reading', col_wrap=4)
g.map(plt.hist, 'Index_of_economic_social_and_cultural_status', bins=40);
In [218]:
# reading level by ESCS
sb.violinplot(data=exploration_df1, x='ESCS_levels', y='Plausible_value_in_reading', palette='mako_r', inner='quartile');
sb.boxplot(data=exploration_df1, x='ESCS_levels', y='Plausible_value_in_reading', color='white');

The relationship between reading and ESCS is similar to math and ESCS.

In [219]:
# Math value by gender (normalized) 
g = sb.FacetGrid(data=exploration_df1, hue='Gender', hue_order=['Male', 'Female'], height=6, aspect=1.5)
g.map(sb.distplot, 'Plausible_value_in_mathematics',  norm_hist=True);
g.add_legend();
In [220]:
exploration_df1.groupby('Gender').Plausible_value_in_mathematics.describe()
Out[220]:
count mean std min 25% 50% 75% max
Gender
Female 83782.0 505.895564 95.682469 113.8885 441.2759 506.3951 571.0469 912.2994
Male 83130.0 518.824645 103.038972 93.6361 450.6232 521.3507 590.9098 962.2293

Girls seem to score a bit worst, generally: their mean is slightly lower, and their std is narrower.

In [221]:
# reading values by gender
g = sb.FacetGrid(data=exploration_df1, hue='Gender', hue_order=['Male', 'Female'], height=6, aspect=1.5)
g.map(sb.distplot, 'Plausible_value_in_reading',  norm_hist=True);
g.add_legend();
In [222]:
# Reading level by gender (normalized) 
read_gender_norm = exploration_df1.groupby('Gender')['Plausible_level_reading'].value_counts(normalize=True)
read_gender_norm = read_gender_norm.mul(100)
read_gender_norm = read_gender_norm.rename('percent').reset_index()

# turn Plausible_level into an ordered category again
read_gender_norm.Plausible_level_reading = read_gender_norm.Plausible_level_reading.astype(ordered_reading)

g = sb.catplot(data=read_gender_norm, x='Plausible_level_reading',y='percent',hue='Gender',
               hue_order=['Male', 'Female'], kind='point')

Contrary to math levels by gender, READING levels by gender suggests that girls are better than boys in this task.

In [223]:
# the one I am really interested in: Math level by language type
sb.countplot(data=exploration_df1, x='Plausible_level_math', hue='Language_type', palette='icefire');
plt.legend(loc=1)
#I know it is a diverging palette, but with 3 categories it just is clearer and more pleasant than all the qualitative ones
Out[223]:
<matplotlib.legend.Legend at 0x23e72b27508>

There appear to be a difference in the performance among the 3 linguistic groups, but before trying to describe, let's normalize the counts.

In [224]:
# distribution of math scores by language type (normalized)
math_language_norm = exploration_df1.groupby('Language_type')['Plausible_level_math'].value_counts(normalize=True)
math_language_norm = math_language_norm.mul(100)
math_language_norm = math_language_norm.rename('percent').reset_index()

# turn Plausible_level into an ordered category again
math_language_norm.Plausible_level_math = math_language_norm.Plausible_level_math.astype(ordered_math)

g = sb.catplot(data=math_language_norm, x='Plausible_level_math',y='percent',hue='Language_type',
               hue_order=['deep orthography', 'shallow orthography', 'logographic'], kind='bar', palette='icefire')
In [225]:
# math values by language type
g = sb.FacetGrid(data=exploration_df1, hue='Language_type', height=6, aspect=1.5)
g.map(sb.distplot, 'Plausible_value_in_mathematics',  kde=False, bins=120);
g.add_legend();
In [226]:
# math values by language type (normalized)
g = sb.FacetGrid(data=exploration_df1, hue='Language_type', height=6, aspect=1.5)
g.map(sb.distplot, 'Plausible_value_in_mathematics',  norm_hist=True, bins=120);
g.add_legend();
In [227]:
exploration_df1.groupby('Language_type').Plausible_value_in_mathematics.describe()
Out[227]:
count mean std min 25% 50% 75% max
Language_type
deep orthography 76009.0 495.055545 100.269931 93.6361 428.3455 497.3594 564.97120 872.6515
logographic 29200.0 564.770918 101.117630 113.8885 496.5026 568.4764 636.71130 962.2293
shallow orthography 61703.0 508.805846 89.122975 108.5138 448.2084 509.7445 570.61845 856.9170

After normalization (now the percentages within groups are shown) the plot changed very little.

The students using a logographic language perform better. Using performance levels it seems that their distribution is left skewed. Using the values, wich are more fine-grained, the distributions of each language group results normal: The logographic group has an higher mean and a larger std (top of the curve is kind of flattened). The deep orthography language group distribution has a similar shape, but with a lower mean (the lowest of the three). The shallow orthography group is in the middle, with a mean a little above the one of the deep orthography languages, and the smallest std.

In [228]:
# reading values vs language type (normalized)
g = sb.FacetGrid(data=exploration_df1, hue='Language_type', height=5, aspect=1.5)
g.map(sb.distplot, 'Plausible_value_in_reading',  norm_hist=True, bins=120);
g.add_legend();

It looks like these curves are all slightly left-skewed. logographic group scores better in reading as well, deep orthography group tends to have more scores under a 400 values (level 2), but past that level the distribution matches the shallow one.

In [229]:
# ESCS by language type
g = sb.FacetGrid(data=exploration_df1, hue='Language_type', height=5, aspect=1.5)
g.map(sb.distplot, 'Index_of_economic_social_and_cultural_status',  norm_hist=True, bins=120);
g.add_legend();

ESCS and math (or reading) performance are positiveli correlated, BUT a higher ESCS is not the reason of the better performance of logographic groups, since they generally have a worst ESCS

There is one area, Macao-China, for which the PISA2012 dataset records a lot of test administered in Cinese, and a smaller number in English.

In [230]:
print(exploration_df1[exploration_df1.Country=='Macao-China'].Language_of_the_test.value_counts())
Chinese    4281
English      51
Name: Language_of_the_test, dtype: int64
In [231]:
within_macao = exploration_df1[exploration_df1.Country=='Macao-China'].copy()
In [232]:
within_macao.Language_at_home.value_counts()
Out[232]:
Cantonese                              4083
Chinese dialects or languages (MAC)     127
Mandarin                                 71
English                                  51
Name: Language_at_home, dtype: int64
In [233]:
# reading values score by language at home
g = sb.FacetGrid(data=within_macao, hue='Language_at_home', height=5, aspect=1.5)
g.map(sb.distplot, 'Plausible_value_in_reading',  norm_hist=True, bins=20);
g.add_legend();
In [234]:
# math values by language at home
g = sb.FacetGrid(data=within_macao, hue='Language_at_home', height=5, aspect=1.5)
g.map(sb.distplot, 'Plausible_value_in_mathematics',  norm_hist=True, bins=20);
g.add_legend();

Data for Macao-China has not changed. The number of test in English is not large (51), however Mandarin speaking students here are not much more (70) and the distribution of their scores follows the larger sample Cantonese distribution (4056 students).

If anything, we can think that, since the distribution of math scores for the English student is different from the one of the larger dataset, in the way that it looks bimodal as the reading values one, language is not a barrier per se.

NOTE: unfortunately the school system in Macao-China "does not have a single centralised set of standards or curriculum. Individual schools follow different educational models, including Chinese, Portuguese, Hong Kong, and British systems.", nonetheless "the majority of the schools in Macau are grammar schools, which offer language learning, mathematics, science subjects, social studies, etc. to the pupils", therefore it is reasonable to think that these data don't come from vocational school students (vocational schools anyhow starts there after 15).

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

  • Math levels vs reading levels: There is a clear positive correlation between test result in reading and test results in mathematics.

  • Math levels by ESCS: If, we plot the math levels by the ESCS index, we see the curve starting slightly right-skewed for math level 'below 1', going normal and then bending on the other side, with an ESCS index mean that increases, as we move up the math levels. When we reach level 6 it is definitely left-skewed.

    There looks to be a correlation here, and it is a bit sad: very bad levels in mathematics can be "achieved" by students from every socioeconomic and cultural background (but actually, students in the lowest ESCS level are not the worst achievers). On the other side, students living in the most disadvantaged condition are not presents in the top mathematics scores.

  • Math levels by gender (normalized): Girls seem to score a bit worst, generally: their mean is slightly lower, and their std is narrower.

  • Math values by language type: The students using a logographic language perform better. Using performance levels it seems that their distribution is left skewed. Using the values, wich are more fine-grained, the distributions of each language group results normal:

    • The logographic group has an higher mean and a larger std (top of the curve is kind of flattened).
    • The deep orthography language group distribution has a similar shape, but with a lower mean (the lowest of the three).
    • The shallow orthography group is in the middle, with a mean a little above the one of the deep orthography languages, and the smallest std.

Data for Macao-China has not changed. The number of test in English is not large (51), however Mandarin speaking students here are not much more (70) and the distribution of their scores follows the larger sample Cantonese distribution (4056 students).

If anything, we can think that, since the distribution of math scores for the English student is different from the one of the larger dataset, in the way that it looks bimodal as the reading values one, language is not a barrier per se.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

  • Contrary to math levels by gender, READING levels by gender suggests that girls are better than boys in this task.

  • reading and ESCS: The relationship between reading and ESCS is similar to math and ESCS.

Multivariate Exploration

In [235]:
# relationship between ESCS and math values by language type
g = sb.FacetGrid(data=exploration_df1, hue='Language_type', height=5, aspect=1.5)
g.map(sb.regplot, 'Index_of_economic_social_and_cultural_status', 'Plausible_value_in_mathematics', scatter_kws={'alpha':0.01})

# .add_legend() does not work right (colors are depending by the alpha value)

import matplotlib

name_to_color = {
    'deep orthography':   sb.color_palette()[0],
    'shallow orthography':   sb.color_palette()[1],
    'logographic': sb.color_palette()[2],
}

patches = [matplotlib.patches.Patch(color=v, label=k) for k,v in name_to_color.items()]
plt.legend(handles=patches);

It looks like the ESCS level is more important in Countries with a deep orthograpy language. Shallow orthography and logograpic languages show a similar correlation between ESCS and Math scores, but there is some other factor that places the logografic group higher.

In [236]:
g = sb.FacetGrid(data=exploration_df1, hue='Language_type', height=5, aspect=1.5)
g.map(sb.regplot, 'Index_of_economic_social_and_cultural_status', 'Plausible_value_in_mathematics', scatter_kws={'alpha':0.01})
Out[236]:
<seaborn.axisgrid.FacetGrid at 0x23e72c8b148>
In [237]:
# relationship between reading and math values by language type
g = sb.FacetGrid(data=exploration_df1, hue='Language_type', height=5, aspect=1.5)
g.map(sb.regplot, 'Plausible_value_in_reading', 'Plausible_value_in_mathematics', scatter_kws={'alpha':0.01})

# .add_legend() does not work right

import matplotlib

name_to_color = {
    'deep orthography':   sb.color_palette()[0],
    'shallow orthography':   sb.color_palette()[1],
    'logographic': sb.color_palette()[2],
}

patches = [matplotlib.patches.Patch(color=v, label=k) for k,v in name_to_color.items()]
plt.legend(handles=patches);

(all language type in the same graph, because it is easier to compare the angle of the regression line)

It looks like for the logographic languages the correlation between reading performance and math performance is stronger: for the same performance in reading at the PISA test, students in the logographic group perform better in math than students of the other groups.

Of course, this could very vell depend on other causes, like a schooling system that works harder on mathematics.

In [238]:
# relationship between reading and math values by language type, by country
g = sb.FacetGrid(data=exploration_df1, hue='Language_type', col='Country',
                 col_wrap=3, xlim=(0,1000), ylim=(0,1000), legend_out=True)
g.map(sb.regplot, 'Plausible_value_in_reading', 'Plausible_value_in_mathematics', scatter_kws={'alpha':0.1});

name_to_color = {
    'deep orthography':   sb.color_palette()[0],
    'shallow orthography':   sb.color_palette()[1],
    'logographic': sb.color_palette()[2],
}

patches = [matplotlib.patches.Patch(color=v, label=k) for k,v in name_to_color.items()]
plt.legend(handles=patches, loc=6, bbox_to_anchor=(1,0.5));

There are a few interesting Countries with two languages type each Belgium, Switzerland, Luxembourg : deep and shallow orthography Hong Kong-China, Macao-China : deep orthography and logographic

In [239]:
# focus on those countries
exploration_df1_subset = exploration_df1[exploration_df1.Country.isin(['Belgium', 'Switzerland', 'Luxembourg', 'Hong Kong-China', 'Macao-China'])]
In [240]:
# plot them
g = sb.FacetGrid(data=exploration_df1_subset, hue='Language_type', col='Country',
                 col_wrap=3, xlim=(0,1000), ylim=(0,1000), legend_out=True)
g.map(sb.regplot, 'Plausible_value_in_reading', 'Plausible_value_in_mathematics', scatter_kws={'alpha':0.1});

name_to_color = {
    'deep orthography':   sb.color_palette()[0],
    'shallow orthography':   sb.color_palette()[1],
    'logographic': sb.color_palette()[2],
}

patches = [matplotlib.patches.Patch(color=v, label=k) for k,v in name_to_color.items()]
plt.legend(handles=patches, loc=6, bbox_to_anchor=(1,0.5));
In [241]:
# number of rows each
exploration_df1_subset.groupby('Country').Language_of_the_test.value_counts()
Out[241]:
Country          Language_of_the_test
Belgium          French                  2399
                 German                   494
Hong Kong-China  Cantonese               4157
                 English                   13
Luxembourg       French                   455
                 German                   118
                 English                   83
Macao-China      Chinese                 4281
                 English                   51
Switzerland      German                  4983
                 French                  3444
                 Italian                  304
Name: Language_of_the_test, dtype: int64

I have already seen that Macao has very few English (blue) datapoints. Hong Kong has even less of them (13), so I would not dare to make comments on the trendline.

In Belgium and Luxembourg there is a conversion of the regression line. {Luxembourg](https://omniglot.com/writing/luxembourgish.htm) schooling system is multilingual and students from different language backgrounds study together. Belgium schooling system is different, as it is divided by linguistic groups, but has a common bases of basic competence levels that has to be reached by every student at the end of every cycle. The Swiss system is differently organized, in the way that the Federation only defines the compulsory years of education, while each Cantons has then the faculty of organizing the system. The shallow language group here, contains both German and Italian. German Cantons and the Italian Ticino may be interesting to compare (even if Italian datapoints are much less: 304 vs 4983).

In [242]:
# sample the german and french data for switzerland
np.random.seed(55)
switzerland_df = exploration_df1.query('Country=="Switzerland"')

fractions_to_drop = {'French':.9, 'German':.95}
for lang, frac in fractions_to_drop.items():
    switzerland_df = switzerland_df.drop(switzerland_df.loc[switzerland_df.Language_of_the_test == lang].sample(frac=frac).index)
    
switzerland_df.Language_of_the_test.value_counts()
Out[242]:
French     344
Italian    304
German     249
Name: Language_of_the_test, dtype: int64
In [243]:
# ESCS by language type in Switzerland
g = sb.FacetGrid(data=exploration_df1.query('Country=="Switzerland"'),
                 hue='Language_of_the_test', palette='icefire', height=6, aspect=1.5)
g.map(sb.distplot, 'Index_of_economic_social_and_cultural_status',  hist=False, bins=120);
g.add_legend();

ESCS index distribution is the same for the different linguistic areas

In [244]:
# ESCS, math, language type
# plot the full data and the sampled ones, to be sure the trendlines stay the same

g1 = sb.FacetGrid(data=exploration_df1.query('Country=="Switzerland"'), hue='Language_of_the_test',
                 palette='icefire', xlim=(-6,5), ylim=(100,800), height=5, aspect=1)
g1.map(sb.regplot, 'Index_of_economic_social_and_cultural_status', 'Plausible_value_in_mathematics', scatter=False)
g1.add_legend()


#g2 = sb.FacetGrid(data=switzerland_df, hue='Language_of_the_test', palette='icefire', height=5, aspect=1)
#g2.map(sb.regplot, 'Plausible_value_in_reading', 'Plausible_value_in_mathematics')
#g2.add_legend()
Out[244]:
<seaborn.axisgrid.FacetGrid at 0x23e1d8812c8>

This graph seems to suggest that the schooling system is more important than the language type. Here we have, in the order, a shallow language group member (Italian) that for high values of reading performance, scores worse in mathemathic than a deep language (French), which, however, scores worse than another shallow language (German).

In [245]:
g2 = sb.FacetGrid(data=exploration_df1_subset.query('Country=="Switzerland"'), hue='Language_of_the_test', col='ESCS_levels',
                  col_wrap=2, palette='icefire', height=5, aspect=1, xlim=(100,900), ylim=(100,900))
g2.map(sb.regplot, 'Plausible_value_in_reading', 'Plausible_value_in_mathematics')
g2.add_legend();

To have a better look at the role of the educational system, since it seems most relevant, we can observe the distributions of the English speaking Countries.

In [246]:
# get the subset of english speaking Countries
english_df = exploration_df1.query('Language_of_the_test=="English"')
english_df.groupby('Country').Plausible_value_in_mathematics.describe()
Out[246]:
count mean std min 25% 50% 75% max
Country
Australia 12385.0 497.755682 94.624259 115.4463 431.383400 496.42470 563.802800 848.9718
Canada 13382.0 509.237250 84.034731 195.4432 451.324200 508.10870 567.697500 789.2273
Connecticut (USA) 1467.0 510.091446 95.900723 259.5498 439.718000 510.52340 579.381550 773.7264
Florida (USA) 1577.0 469.068658 84.338675 187.1086 410.118400 466.82510 526.725300 727.3797
Hong Kong-China 13.0 472.511292 107.507339 329.0310 407.625800 446.72850 579.848900 652.9911
Ireland 4571.0 502.567121 84.053601 196.1443 446.650600 503.66880 560.258650 754.1751
Luxembourg 83.0 589.320954 69.063842 399.8364 549.431400 589.89720 637.490250 760.5624
Macao-China 51.0 511.609378 126.540083 195.7548 430.916050 496.65840 616.225300 731.7417
Massachusetts (USA) 1466.0 517.407266 92.017642 256.5898 451.109975 514.14550 581.445750 829.1868
New Zealand 3474.0 509.712848 94.182427 188.4328 443.456900 508.03080 576.460475 846.7908
Qatar 722.0 496.017731 93.350339 152.6017 432.415450 500.20255 567.892175 719.2008
Singapore 2312.0 597.111246 97.415075 205.1020 534.203125 606.99485 664.519400 872.6515
United Kingdom 11197.0 493.581955 89.497389 180.8771 430.916000 493.30890 554.767100 817.5027
United States of America 4179.0 487.319961 87.764521 220.6029 424.801350 483.18270 547.639850 768.5076
In [247]:
# have a look at their math values distribution
g = sb.FacetGrid(data=english_df, row='Country', height=1, aspect=3, xlim=(100,900), ylim=(0,0.01))
g.map(sb.distplot, 'Plausible_value_in_mathematics', norm_hist=True, bins=100)
Out[247]:
<seaborn.axisgrid.FacetGrid at 0x23e604e4d48>

I will keep only the Countries with more than 1000 rows, and then I will add the ESCS level (cutting the two lowest and the top one, which are almost empty)

In [248]:
# select Countries with more than 1000 rows
english_count = english_df.Country.value_counts().to_dict().items()

english_df = english_df[english_df.Country.isin([key for key, val in english_count if val > 1000])]
In [249]:
# get the category values to copy and paste
english_df.ESCS_levels.dtype
Out[249]:
CategoricalDtype(categories=['-5.32 to -4.38', '-4.38 to -3.44', '-3.44 to -2.51',
                  '-2.51 to -1.57', '-1.57 to -0.63', '-0.63 to 0.31',
                  '0.31 to 1.24', '1.24 to 2.18', '2.18 to 3.12'],
                 ordered=True)
In [250]:
# drop the lines in the two lowest ESCS levels and in the top one
ESCS_drop = ['-5.32 to -4.38', '-4.38 to -3.44', '2.18 to 3.12']
english_df = english_df.drop(english_df[english_df.ESCS_levels.isin(ESCS_drop)].index)
# Modify the category
ESCS_ordered2 = ['-3.44 to -2.51', '-2.51 to -1.57', '-1.57 to -0.63', '-0.63 to 0.31', '0.31 to 1.24', '1.24 to 2.18']
ESCS_middle_cat = pd.api.types.CategoricalDtype(ordered=True, categories=ESCS_ordered2)
english_df.ESCS_levels = english_df.ESCS_levels.astype(ESCS_middle_cat)
In [251]:
# plot the math scores by Country divided by ESCS level
g = sb.FacetGrid(data=english_df, col='ESCS_levels', row='Country', height=1, aspect=1.8, xlim=(100,900), ylim=(0,0.01))
g.map(sb.distplot, 'Plausible_value_in_mathematics', norm_hist=True, bins=100)
g.add_legend();
C:\Users\annap\Anaconda3\lib\site-packages\numpy\core\_methods.py:217: RuntimeWarning: Degrees of freedom <= 0 for slice
  keepdims=keepdims)
C:\Users\annap\Anaconda3\lib\site-packages\numpy\core\_methods.py:209: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
C:\Users\annap\Anaconda3\lib\site-packages\seaborn\axisgrid.py:848: UserWarning: Tight layout not applied. tight_layout cannot make axes width small enough to accommodate all axes decorations
  self.fig.tight_layout()
In [252]:
# plot the math scores by Country divided by ESCS level (English speaking countries)

plt.figure(figsize=(15,12))
g = sb.pointplot(data=english_df, x="ESCS_levels", y="Plausible_value_in_mathematics",
                 hue="Country", dodge=.3, palette='bright')
In [253]:
# collect the logographic countries
logographic_df = exploration_df1.query('Language_type=="logographic"')
logographic_df.groupby('Country').Language_of_the_test.value_counts()
Out[253]:
Country          Language_of_the_test
China-Shanghai   Shanghai dialect        5054
Chinese Taipei   Mandarin                4695
Hong Kong-China  Cantonese               4157
Japan            Japanese                6059
Korea            Korean                  4954
Macao-China      Chinese                 4281
Name: Language_of_the_test, dtype: int64
In [254]:
# drop the lines in the two lowest ESCS levels and in the top one
# list is ESCS_drop = ['-5.32 to -4.38', '-4.38 to -3.44', '2.18 to 3.12']
logographic_df = logographic_df.drop(logographic_df[logographic_df.ESCS_levels.isin(ESCS_drop)].index)
# Modify the category
# cat is ESCS_ordered2 = ['-3.44 to -2.51', '-2.51 to -1.57', '-1.57 to -0.63', '-0.63 to 0.31', '0.31 to 1.24', '1.24 to 2.18']
# ESCS_middle_cat = pd.api.types.CategoricalDtype(ordered=True, categories=ESCS_ordered2)
logographic_df.ESCS_levels = logographic_df.ESCS_levels.astype(ESCS_middle_cat)
In [255]:
# plot the math scores by logographic languages (=Countries) divided byESCS (logographic)

plt.figure(figsize=(10,8))
g = sb.pointplot(data=logographic_df, x="ESCS_levels", y="Plausible_value_in_mathematics", hue="Country")
In [256]:
# plot the math scores by Country divided by ESCS level
g = sb.FacetGrid(data=logographic_df, col='ESCS_levels', row='Country', height=1, aspect=2, xlim=(100,900), ylim=(0,0.01))
g.map(sb.distplot, 'Plausible_value_in_mathematics', norm_hist=True, bins=100)
g.add_legend();
In [257]:
# just to have them all

g = sb.FacetGrid(data=exploration_df1, hue="Country", col='Language_type', palette='bright', height=6, aspect=1.5)
g.map(sb.pointplot, "ESCS_levels", "Plausible_value_in_mathematics",
                 dodge=.3)
C:\Users\annap\Anaconda3\lib\site-packages\seaborn\axisgrid.py:715: UserWarning: Using the pointplot function without specifying `order` is likely to produce an incorrect plot.
  warnings.warn(warning)
Out[257]:
<seaborn.axisgrid.FacetGrid at 0x23e7c3331c8>
In [258]:
# just to have them all

sb.catplot(data=exploration_df1, x='ESCS_levels', y='Plausible_value_in_mathematics',
           hue='Language_type', kind='point');

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

  • relationship between ESCS and math values by language type: it looks like the ESCS level is more important in Countries with a deep orthograpy language. Shallow orthography and logograpic languages show a similar correlation between ESCS and Math scores, but there is some other factor that places the logografic group higher.

  • relationship between reading and math values by language type, by country: there are a few interesting Countries with two languages type each Belgium, Switzerland, Luxembourg : deep and shallow orthography Hong Kong-China, Macao-China : deep orthography and logographic Macao has very few English (blue) datapoints. Hong Kong has even less of them (13), so I would not dare to make comments on the trendline.

    In Belgium and Luxembourg there is a conversion of the regression line. {Luxembourg](https://omniglot.com/writing/luxembourgish.htm) schooling system is multilingual and students from different language backgrounds study together. Belgium schooling system is different, as it is divided by linguistic groups, but has a common bases of basic competence levels that has to be reached by every student at the end of every cycle.

    The Swiss system is differently organized, in the way that the Federation only defines the compulsory years of education, while each Cantons has then the faculty of organizing the system. The shallow language group here, contains both German and Italian. German Cantons and the Italian Ticino may be interesting to compare (even if Italian datapoints are much less: 304 vs 4983).

Were there any interesting or surprising interactions between features?

Focus on Switzerland:

  • ESCS by language type in Switzerland: distribution is the same
  • reading, math, language type: This graph seems to suggest that the schooling system is more important than the language type. Here we have, in the order, a shallow language group member (Italian) that for high values of reading performance, scores worse in mathemathic than a deep language (French), which, however, scores worse than another shallow language (German).

Focus on English:

  • math scores by Country divided by ESCS level (English speaking countries)
  • math scores by ESCS for logographic languages Countries English speaking countries and Singapore by ESCS, make clear that the same language can bring to very different results Since it is the only Asian country in my English speaking selection, it also suggest a role of cultural background and/or multilingualism
In [259]:
# save the dataset needed in the slideshow

exploration_df1.to_csv('df_slideshow1.csv', index=False, encoding='utf-8')
In [ ]: